Dataset statistics
Number of variables | 29 |
---|---|
Number of observations | 2043 |
Missing cells | 24247 |
Missing cells (%) | 40.9% |
Duplicate rows | 0 |
Duplicate rows (%) | 0.0% |
Total size in memory | 6.5 MiB |
Average record size in memory | 3.3 KiB |
Variable types
Categorical | 24 |
---|---|
Numeric | 3 |
Boolean | 2 |
Alerts
min_length has constant value "8.0" | Constant |
encoder_no_repeat_ngram_size has constant value "4.0" | Constant |
num_beam_groups has constant value "1.0" | Constant |
early_stopping has constant value "True" | Constant |
do_sample has constant value "False" | Constant |
format has constant value "paragraph" | Constant |
penalty_alpha has constant value "0.6" | Constant |
runtime has constant value "45:30" | Constant |
GAUNTLET_PATH has a high cardinality: 2043 distinct values | High cardinality |
summary has a high cardinality: 1960 distinct values | High cardinality |
date has a high cardinality: 96 distinct values | High cardinality |
no_repeat_ngram_size is highly imbalanced (68.2%) | Imbalance |
repetition_penalty is highly imbalanced (80.7%) | Imbalance |
length_penalty is highly imbalanced (85.8%) | Imbalance |
min_length has 324 (15.9%) missing values | Missing |
max_length has 81 (4.0%) missing values | Missing |
no_repeat_ngram_size has 305 (14.9%) missing values | Missing |
encoder_no_repeat_ngram_size has 305 (14.9%) missing values | Missing |
repetition_penalty has 324 (15.9%) missing values | Missing |
num_beams has 233 (11.4%) missing values | Missing |
num_beam_groups has 324 (15.9%) missing values | Missing |
length_penalty has 354 (17.3%) missing values | Missing |
early_stopping has 314 (15.4%) missing values | Missing |
do_sample has 324 (15.9%) missing values | Missing |
date has 243 (11.9%) missing values | Missing |
length has 1962 (96.0%) missing values | Missing |
format has 1962 (96.0%) missing values | Missing |
extractiveness has 1962 (96.0%) missing values | Missing |
temperature has 1962 (96.0%) missing values | Missing |
token_batch_length has 1700 (83.2%) missing values | Missing |
penalty_alpha has 1962 (96.0%) missing values | Missing |
top_k has 1962 (96.0%) missing values | Missing |
batch_stride has 1791 (87.7%) missing values | Missing |
max_len_ratio has 1943 (95.1%) missing values | Missing |
directory-topic-tag has 1943 (95.1%) missing values | Missing |
runtime has 1967 (96.3%) missing values | Missing |
GAUNTLET_PATH is uniformly distributed | Uniform |
summary is uniformly distributed | Uniform |
source_doc_filename is uniformly distributed | Uniform |
source_doc_id is uniformly distributed | Uniform |
GAUNTLET_PATH has unique values | Unique |
Reproduction
Analysis started | 2023-05-24 07:04:02.234426 |
---|---|
Analysis finished | 2023-05-24 07:04:06.861090 |
Duration | 4.63 seconds |
Software version | pandas-profiling v3.6.6 |
Download configuration | config.json |
GAUNTLET_PATH
Categorical
HIGH CARDINALITY
UNIFORM
UNIQUE
Distinct | 2043 |
---|---|
Distinct (%) | 100.0% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 345.0 KiB |
SHORT-CONTEXT-MODELS/flan-t5-3b-summarizer/beam-search-8192-nb4/ASR-whisper-rpunctuated_Noam Chomsky, Fundam_1669853561_0_part1_summary.txt | 1 |
---|---|
long-t5/long-t5-base-booksci-summary-v1/bs-16384-nb-8/script_strangersonatrain_summary.txt | 1 |
long-t5/long-t5-base-booksci-summary-v1/bs-8192-nb-16/OCR_PAPER_dall-e-2-annotated__summary.txt | 1 |
long-t5/long-t5-base-booksci-summary-v1/bs-8192-nb-16/OCR_PAPER_Kandpal, Nieto, Jin - 2022 - Music Enhancement via Image Translation and Vocoding-annotated__summary.txt | 1 |
long-t5/long-t5-base-booksci-summary-v1/bs-8192-nb-16/OCR_PAPER_Hong et al. - 2022 - CogVideo Large-scale Pretraining for Text-to-Video Generation via Transformers-annotated__summary.txt | 1 |
Other values (2038) |
Length
Max length | 232 |
---|---|
Median length | 183 |
Mean length | 115.86686 |
Min length | 49 |
Characters and Unicode
Total characters | 236716 |
---|---|
Distinct characters | 61 |
Distinct categories | 7 ? |
Distinct scripts | 2 ? |
Distinct blocks | 1 ? |
Unique
Unique | 2043 ? |
---|---|
Unique (%) | 100.0% |
Sample
1st row | SHORT-CONTEXT-MODELS/flan-t5-3b-summarizer/beam-search-8192-nb4/ASR-whisper-rpunctuated_Noam Chomsky, Fundam_1669853561_0_part1_summary.txt |
---|---|
2nd row | SHORT-CONTEXT-MODELS/flan-t5-3b-summarizer/beam-search-8192-nb4/ASR-whisper-rpunctuated_Noam Chomsky, Fundam_1669853631_0_part2_summary.txt |
3rd row | SHORT-CONTEXT-MODELS/flan-t5-3b-summarizer/beam-search-8192-nb4/ASRnlp_law_lecture_week_1_v_2_c_transcription_1_summary.txt |
4th row | SHORT-CONTEXT-MODELS/flan-t5-3b-summarizer/beam-search-8192-nb4/ASRnlp_law_lecture_week_2_v_2_c_transcription_2_summary.txt |
5th row | SHORT-CONTEXT-MODELS/flan-t5-3b-summarizer/beam-search-8192-nb4/ASRnlp_law_lecture_week_3_part_1_v_2_c_transcription_3_summary.txt |
Common Values
Value | Count | Frequency (%) |
SHORT-CONTEXT-MODELS/flan-t5-3b-summarizer/beam-search-8192-nb4/ASR-whisper-rpunctuated_Noam Chomsky, Fundam_1669853561_0_part1_summary.txt | 1 | < 0.1% |
long-t5/long-t5-base-booksci-summary-v1/bs-16384-nb-8/script_strangersonatrain_summary.txt | 1 | < 0.1% |
long-t5/long-t5-base-booksci-summary-v1/bs-8192-nb-16/OCR_PAPER_dall-e-2-annotated__summary.txt | 1 | < 0.1% |
long-t5/long-t5-base-booksci-summary-v1/bs-8192-nb-16/OCR_PAPER_Kandpal, Nieto, Jin - 2022 - Music Enhancement via Image Translation and Vocoding-annotated__summary.txt | 1 | < 0.1% |
long-t5/long-t5-base-booksci-summary-v1/bs-8192-nb-16/OCR_PAPER_Hong et al. - 2022 - CogVideo Large-scale Pretraining for Text-to-Video Generation via Transformers-annotated__summary.txt | 1 | < 0.1% |
long-t5/long-t5-base-booksci-summary-v1/bs-8192-nb-16/OCR_ML4HLecture05-NLP.pptx__summary.txt | 1 | < 0.1% |
long-t5/long-t5-base-booksci-summary-v1/bs-8192-nb-16/OCR_ML4HLecture04RepresentationLearning.pptx__summary.txt | 1 | < 0.1% |
long-t5/long-t5-base-booksci-summary-v1/bs-8192-nb-16/OCR_ML4HLecture02image__summary.txt | 1 | < 0.1% |
long-t5/long-t5-base-booksci-summary-v1/bs-8192-nb-16/Emie_dissertation_cleansed_summary.txt | 1 | < 0.1% |
long-t5/long-t5-base-booksci-summary-v1/bs-8192-nb-16/ASRnlp_law_lecture_week_3_part_1_v_2_c_transcription_3_summary.txt | 1 | < 0.1% |
Other values (2033) | 2033 |
Length
Value | Count | Frequency (%) |
476 | 7.9% | |
2022 | 221 | 3.6% |
via | 221 | 3.6% |
chomsky | 212 | 3.5% |
cogvideo | 111 | 1.8% |
large-scale | 111 | 1.8% |
pretraining | 111 | 1.8% |
al | 111 | 1.8% |
et | 111 | 1.8% |
for | 111 | 1.8% |
Other values (1968) | 4267 |
Most occurring characters
Value | Count | Frequency (%) |
- | 17654 | 7.5% |
a | 15514 | 6.6% |
e | 14926 | 6.3% |
t | 14442 | 6.1% |
s | 12967 | 5.5% |
r | 11615 | 4.9% |
n | 10481 | 4.4% |
m | 10414 | 4.4% |
_ | 8955 | 3.8% |
o | 8528 | 3.6% |
Other values (51) | 111220 |
Most occurring categories
Value | Count | Frequency (%) |
Lowercase Letter | 160105 | |
Uppercase Letter | 19122 | 8.1% |
Dash Punctuation | 17654 | 7.5% |
Decimal Number | 17207 | 7.3% |
Other Punctuation | 9653 | 4.1% |
Connector Punctuation | 8955 | 3.8% |
Space Separator | 4020 | 1.7% |
Most frequent character per category
Lowercase Letter
Value | Count | Frequency (%) |
a | 15514 | 9.7% |
e | 14926 | 9.3% |
t | 14442 | 9.0% |
s | 12967 | 8.1% |
r | 11615 | 7.3% |
n | 10481 | 6.5% |
m | 10414 | 6.5% |
o | 8528 | 5.3% |
u | 7178 | 4.5% |
l | 7117 | 4.4% |
Other values (14) | 46923 |
Uppercase Letter
Value | Count | Frequency (%) |
L | 2012 | |
R | 1902 | |
E | 1897 | |
O | 1587 | 8.3% |
S | 1500 | 7.8% |
C | 1466 | 7.7% |
A | 1265 | 6.6% |
T | 1151 | 6.0% |
M | 1085 | 5.7% |
D | 1044 | 5.5% |
Other values (12) | 4213 |
Decimal Number
Value | Count | Frequency (%) |
5 | 2674 | |
1 | 2491 | |
2 | 2454 | |
8 | 2090 | |
4 | 2074 | |
6 | 1797 | |
3 | 1297 | |
0 | 1236 | |
9 | 1094 |
Other Punctuation
Value | Count | Frequency (%) |
/ | 6668 | |
. | 2553 | 26.4% |
, | 432 | 4.5% |
Dash Punctuation
Value | Count | Frequency (%) |
- | 17654 |
Connector Punctuation
Value | Count | Frequency (%) |
_ | 8955 |
Space Separator
Value | Count | Frequency (%) |
4020 |
Most occurring scripts
Value | Count | Frequency (%) |
Latin | 179227 | |
Common | 57489 | 24.3% |
Most frequent character per script
Latin
Value | Count | Frequency (%) |
a | 15514 | 8.7% |
e | 14926 | 8.3% |
t | 14442 | 8.1% |
s | 12967 | 7.2% |
r | 11615 | 6.5% |
n | 10481 | 5.8% |
m | 10414 | 5.8% |
o | 8528 | 4.8% |
u | 7178 | 4.0% |
l | 7117 | 4.0% |
Other values (36) | 66045 |
Common
Value | Count | Frequency (%) |
- | 17654 | |
_ | 8955 | |
/ | 6668 | 11.6% |
4020 | 7.0% | |
5 | 2674 | 4.7% |
. | 2553 | 4.4% |
1 | 2491 | 4.3% |
2 | 2454 | 4.3% |
8 | 2090 | 3.6% |
4 | 2074 | 3.6% |
Other values (5) | 5856 | 10.2% |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 236716 |
Most frequent character per block
ASCII
Value | Count | Frequency (%) |
- | 17654 | 7.5% |
a | 15514 | 6.6% |
e | 14926 | 6.3% |
t | 14442 | 6.1% |
s | 12967 | 5.5% |
r | 11615 | 4.9% |
n | 10481 | 4.4% |
m | 10414 | 4.4% |
_ | 8955 | 3.8% |
o | 8528 | 3.6% |
Other values (51) | 111220 |
file_name
Categorical
Distinct | 41 |
---|---|
Distinct (%) | 2.0% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 224.1 KiB |
OCR_PAPER_Hong et al. - 2022 - CogVideo Large-scale Pretraining for Text-to-Video Generation via Transformers-annotated__summary.txt | 106 |
---|---|
OCR_ML4HLecture02image__summary.txt | 104 |
OCR_ML4HLecture04RepresentationLearning.pptx__summary.txt | 104 |
OCR_ML4HLecture05-NLP.pptx__summary.txt | 104 |
OCR_PAPER_Kandpal, Nieto, Jin - 2022 - Music Enhancement via Image Translation and Vocoding-annotated__summary.txt | 104 |
Other values (36) |
Length
Max length | 132 |
---|---|
Median length | 66 |
Mean length | 55.266275 |
Min length | 26 |
Characters and Unicode
Total characters | 112909 |
---|---|
Distinct characters | 58 |
Distinct categories | 7 ? |
Distinct scripts | 2 ? |
Distinct blocks | 1 ? |
Unique
Unique | 0 ? |
---|---|
Unique (%) | 0.0% |
Sample
1st row | ASR-whisper-rpunctuated_Noam Chomsky, Fundam_1669853561_0_part1_summary.txt |
---|---|
2nd row | ASR-whisper-rpunctuated_Noam Chomsky, Fundam_1669853631_0_part2_summary.txt |
3rd row | ASRnlp_law_lecture_week_1_v_2_c_transcription_1_summary.txt |
4th row | ASRnlp_law_lecture_week_2_v_2_c_transcription_2_summary.txt |
5th row | ASRnlp_law_lecture_week_3_part_1_v_2_c_transcription_3_summary.txt |
Common Values
Value | Count | Frequency (%) |
OCR_PAPER_Hong et al. - 2022 - CogVideo Large-scale Pretraining for Text-to-Video Generation via Transformers-annotated__summary.txt | 106 | 5.2% |
OCR_ML4HLecture02image__summary.txt | 104 | 5.1% |
OCR_ML4HLecture04RepresentationLearning.pptx__summary.txt | 104 | 5.1% |
OCR_ML4HLecture05-NLP.pptx__summary.txt | 104 | 5.1% |
OCR_PAPER_Kandpal, Nieto, Jin - 2022 - Music Enhancement via Image Translation and Vocoding-annotated__summary.txt | 104 | 5.1% |
The Most Dangerous Game--Richard Connell_summary.txt | 104 | 5.1% |
script_sunsetblvd._summary.txt | 103 | 5.0% |
script_frozendisney_summary.txt | 103 | 5.0% |
gpt_peter_testing_group_exemplars_summary.txt | 103 | 5.0% |
script_findingnemo_summary.txt | 103 | 5.0% |
Other values (31) | 1005 |
Length
Value | Count | Frequency (%) |
476 | 7.9% | |
via | 221 | 3.6% |
2022 | 221 | 3.6% |
chomsky | 212 | 3.5% |
asr-whisper-rpunctuated_noam | 200 | 3.3% |
cogvideo | 111 | 1.8% |
pretraining | 111 | 1.8% |
for | 111 | 1.8% |
text-to-video | 111 | 1.8% |
generation | 111 | 1.8% |
Other values (67) | 4178 |
Most occurring characters
Value | Count | Frequency (%) |
t | 9695 | 8.6% |
_ | 8697 | 7.7% |
a | 7680 | 6.8% |
e | 6981 | 6.2% |
r | 6687 | 5.9% |
n | 5899 | 5.2% |
m | 5414 | 4.8% |
s | 5370 | 4.8% |
4020 | 3.6% | |
i | 3935 | 3.5% |
Other values (48) | 48531 |
Most occurring categories
Value | Count | Frequency (%) |
Lowercase Letter | 79010 | |
Uppercase Letter | 10604 | 9.4% |
Connector Punctuation | 8697 | 7.7% |
Decimal Number | 5585 | 4.9% |
Space Separator | 4020 | 3.6% |
Other Punctuation | 2909 | 2.6% |
Dash Punctuation | 2084 | 1.8% |
Most frequent character per category
Lowercase Letter
Value | Count | Frequency (%) |
t | 9695 | |
a | 7680 | 9.7% |
e | 6981 | 8.8% |
r | 6687 | 8.5% |
n | 5899 | 7.5% |
m | 5414 | 6.9% |
s | 5370 | 6.8% |
i | 3935 | 5.0% |
u | 3676 | 4.7% |
o | 3494 | 4.4% |
Other values (14) | 20179 |
Uppercase Letter
Value | Count | Frequency (%) |
R | 1724 | |
C | 1086 | |
L | 985 | |
P | 872 | |
A | 852 | 8.0% |
O | 654 | 6.2% |
M | 647 | 6.1% |
S | 626 | 5.9% |
E | 544 | 5.1% |
T | 441 | 4.2% |
Other values (10) | 2173 |
Decimal Number
Value | Count | Frequency (%) |
2 | 1515 | |
1 | 851 | |
0 | 761 | |
6 | 636 | |
3 | 535 | 9.6% |
4 | 437 | 7.8% |
5 | 426 | 7.6% |
8 | 212 | 3.8% |
9 | 212 | 3.8% |
Other Punctuation
Value | Count | Frequency (%) |
. | 2477 | |
, | 432 | 14.9% |
Connector Punctuation
Value | Count | Frequency (%) |
_ | 8697 |
Space Separator
Value | Count | Frequency (%) |
4020 |
Dash Punctuation
Value | Count | Frequency (%) |
- | 2084 |
Most occurring scripts
Value | Count | Frequency (%) |
Latin | 89614 | |
Common | 23295 | 20.6% |
Most frequent character per script
Latin
Value | Count | Frequency (%) |
t | 9695 | 10.8% |
a | 7680 | 8.6% |
e | 6981 | 7.8% |
r | 6687 | 7.5% |
n | 5899 | 6.6% |
m | 5414 | 6.0% |
s | 5370 | 6.0% |
i | 3935 | 4.4% |
u | 3676 | 4.1% |
o | 3494 | 3.9% |
Other values (34) | 30783 |
Common
Value | Count | Frequency (%) |
_ | 8697 | |
4020 | ||
. | 2477 | 10.6% |
- | 2084 | 8.9% |
2 | 1515 | 6.5% |
1 | 851 | 3.7% |
0 | 761 | 3.3% |
6 | 636 | 2.7% |
3 | 535 | 2.3% |
4 | 437 | 1.9% |
Other values (4) | 1282 | 5.5% |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 112909 |
Most frequent character per block
ASCII
Value | Count | Frequency (%) |
t | 9695 | 8.6% |
_ | 8697 | 7.7% |
a | 7680 | 6.8% |
e | 6981 | 6.2% |
r | 6687 | 5.9% |
n | 5899 | 5.2% |
m | 5414 | 4.8% |
s | 5370 | 4.8% |
4020 | 3.6% | |
i | 3935 | 3.5% |
Other values (48) | 48531 |
summary
Categorical
HIGH CARDINALITY
UNIFORM
Distinct | 1960 |
---|---|
Distinct (%) | 95.9% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 4.5 MiB |
<no_saic_raw_sp><sep_4><sep_4><sep_4><sep_4><sep_4><sep_4><sep_4> <no_saic_raw_sp><sep_4><sep_4><sep_4><sep_4><sep_4><sep_4><sep_4> | 6 |
---|---|
you're nothing to me, little bitch. | 4 |
I'm a sniper in the U. S. Navy Seals and I have been involved in many secret raids against Al-Qaeda. I've killed over 300 confirmed targets. | 4 |
we present a novel approach to enhance music signals by combining recent advances in conditional image-synthesis and voccoding. We find that our approach achieves an improved perception of music than many state-of the-art methods for audio enhancement. Additionally, we compare the subjective hearing test scores with commonly used audio quality measures and suggest that these metrics correlate well with human perception. | 3 |
The US Marine Corps has the most powerful weaponry in the world. It is possible to kill a person with ease. --- | 3 |
Other values (1955) |
Length
Max length | 31506 |
---|---|
Median length | 2490 |
Mean length | 2182.3671 |
Min length | 15 |
Characters and Unicode
Total characters | 4458576 |
---|---|
Distinct characters | 122 |
Distinct categories | 17 ? |
Distinct scripts | 3 ? |
Distinct blocks | 5 ? |
Unique
Unique | 1894 ? |
---|---|
Unique (%) | 92.7% |
Sample
1st row | There's lots of interesting things to say about language, but I don't think it's as simple as you think. There's lots of interesting things about language, but if you really want to understand it, you need to look at the whole picture. --- |
---|---|
2nd row | There's no such thing as a simple language. There's more than one way to solve ATB. I think you're asking the wrong question. --- |
3rd row | if you don't want to read the whole thing, just skip it. I'm sorry for the wall of text. --- |
4th row | I think it's okay to ask questions about what you want to do in the course. I'm a bit of a nerd. --- |
5th row | I'm not sure if this is the right subreddit to post this. I'm going to finish it in person. --- |
Common Values
Value | Count | Frequency (%) |
<no_saic_raw_sp><sep_4><sep_4><sep_4><sep_4><sep_4><sep_4><sep_4> <no_saic_raw_sp><sep_4><sep_4><sep_4><sep_4><sep_4><sep_4><sep_4> | 6 | 0.3% |
you're nothing to me, little bitch. | 4 | 0.2% |
I'm a sniper in the U. S. Navy Seals and I have been involved in many secret raids against Al-Qaeda. I've killed over 300 confirmed targets. | 4 | 0.2% |
we present a novel approach to enhance music signals by combining recent advances in conditional image-synthesis and voccoding. We find that our approach achieves an improved perception of music than many state-of the-art methods for audio enhancement. Additionally, we compare the subjective hearing test scores with commonly used audio quality measures and suggest that these metrics correlate well with human perception. | 3 | 0.1% |
The US Marine Corps has the most powerful weaponry in the world. It is possible to kill a person with ease. --- | 3 | 0.1% |
The narrator tells the audience that he's been training as a sniper for the U.S. Navy and has killed hundreds of enemy soldiers. He promises to "wipe you the dead kiddo" out of your mouth. | 3 | 0.1% |
The storm that wipes you out with precision, mark my words. what the f*ck did you just say about me? I can go anywhere, anytime and I'll kill you in seven hundred ways and that doesn't even bother to count the number of ways I can destroy you. you're freaking dead, kid." if only you knew what "clever," comment was going to bring down on you, kid. but you don't, so now you've paying the price. you are fucking dead, child. i will hit you all over you; you will drown your sorrow in it. this is a vicious, bloodthirsty, animalistic monster. he will have you know that by the time this chapter ends, you'll have paid the price for saying these things to him over the internet. you won't get away with it. you killed the little punk.You're gonna die, kid - you're... dead, dad.If only you were able to imagine how awful this whole thing is going to be and how much damage it's going to do to you, then you wouldn't be such a moron. you'd be totally screwed. | 3 | 0.1% |
A person threatens someone who insulted them online, claiming to be a highly trained Navy SEAL with access to a network of spies and the entire arsenal of the US Marine Corps. They vow to kill the person in over 700 ways and make them suffer for their comment. | 3 | 0.1% |
The sniper tells the kid that he's going to kill him in seven hundred ways. --- | 3 | 0.1% |
A strategy to integrate a deep learning system into the clinical workflow. --- | 3 | 0.1% |
Other values (1950) | 2008 |
Length
Value | Count | Frequency (%) |
the | 48901 | 6.3% |
to | 29387 | 3.8% |
of | 22874 | 2.9% |
and | 20684 | 2.7% |
a | 20492 | 2.6% |
in | 12531 | 1.6% |
is | 11649 | 1.5% |
he | 10784 | 1.4% |
that | 10572 | 1.4% |
for | 6715 | 0.9% |
Other values (23567) | 583201 |
Most occurring characters
Value | Count | Frequency (%) |
771616 | ||
e | 442639 | 9.9% |
t | 324077 | 7.3% |
a | 274843 | 6.2% |
o | 264412 | 5.9% |
n | 253954 | 5.7% |
s | 245698 | 5.5% |
i | 241682 | 5.4% |
r | 204655 | 4.6% |
h | 197047 | 4.4% |
Other values (112) | 1237953 |
Most occurring categories
Value | Count | Frequency (%) |
Lowercase Letter | 3423698 | |
Space Separator | 771710 | 17.3% |
Uppercase Letter | 115789 | 2.6% |
Other Punctuation | 105255 | 2.4% |
Decimal Number | 13305 | 0.3% |
Control | 11648 | 0.3% |
Dash Punctuation | 8490 | 0.2% |
Close Punctuation | 2571 | 0.1% |
Open Punctuation | 2259 | 0.1% |
Math Symbol | 1984 | < 0.1% |
Other values (7) | 1867 | < 0.1% |
Most frequent character per category
Lowercase Letter
Value | Count | Frequency (%) |
e | 442639 | |
t | 324077 | 9.5% |
a | 274843 | 8.0% |
o | 264412 | 7.7% |
n | 253954 | 7.4% |
s | 245698 | 7.2% |
i | 241682 | 7.1% |
r | 204655 | 6.0% |
h | 197047 | 5.8% |
l | 134361 | 3.9% |
Other values (23) | 840330 |
Uppercase Letter
Value | Count | Frequency (%) |
T | 15733 | |
A | 11976 | 10.3% |
I | 8152 | 7.0% |
S | 7254 | 6.3% |
G | 6838 | 5.9% |
H | 6701 | 5.8% |
B | 6058 | 5.2% |
N | 5593 | 4.8% |
M | 5238 | 4.5% |
C | 5187 | 4.5% |
Other values (18) | 37059 |
Other Punctuation
Value | Count | Frequency (%) |
. | 45900 | |
, | 34895 | |
' | 11580 | 11.0% |
" | 6975 | 6.6% |
: | 1807 | 1.7% |
; | 1198 | 1.1% |
# | 979 | 0.9% |
? | 822 | 0.8% |
/ | 460 | 0.4% |
! | 333 | 0.3% |
Other values (5) | 306 | 0.3% |
Decimal Number
Value | Count | Frequency (%) |
1 | 2593 | |
0 | 2106 | |
2 | 1936 | |
4 | 1641 | |
3 | 1264 | |
9 | 1033 | 7.8% |
5 | 990 | 7.4% |
6 | 730 | 5.5% |
8 | 537 | 4.0% |
7 | 475 | 3.6% |
Control
Value | Count | Frequency (%) |
6832 | ||
4806 | ||
| 5 | < 0.1% |
| 2 | < 0.1% |
| 1 | < 0.1% |
1 | < 0.1% | |
| 1 | < 0.1% |
Math Symbol
Value | Count | Frequency (%) |
> | 867 | |
< | 860 | |
= | 156 | 7.9% |
+ | 82 | 4.1% |
| | 12 | 0.6% |
~ | 7 | 0.4% |
Dash Punctuation
Value | Count | Frequency (%) |
- | 8486 | |
– | 3 | < 0.1% |
— | 1 | < 0.1% |
Close Punctuation
Value | Count | Frequency (%) |
) | 1565 | |
] | 1000 | |
} | 6 | 0.2% |
Open Punctuation
Value | Count | Frequency (%) |
( | 1267 | |
[ | 986 | |
{ | 6 | 0.3% |
Modifier Symbol
Value | Count | Frequency (%) |
` | 4 | |
´ | 1 | 16.7% |
^ | 1 | 16.7% |
Space Separator
Value | Count | Frequency (%) |
771616 | ||
94 | < 0.1% |
Final Punctuation
Value | Count | Frequency (%) |
’ | 35 | |
” | 20 |
Initial Punctuation
Value | Count | Frequency (%) |
“ | 19 | |
‘ | 5 | 20.8% |
Other Symbol
Value | Count | Frequency (%) |
� | 5 | |
¦ | 1 | 16.7% |
Connector Punctuation
Value | Count | Frequency (%) |
_ | 1748 |
Currency Symbol
Value | Count | Frequency (%) |
$ | 27 |
Other Letter
Value | Count | Frequency (%) |
ਤ | 1 |
Most occurring scripts
Value | Count | Frequency (%) |
Latin | 3539487 | |
Common | 919088 | 20.6% |
Gurmukhi | 1 | < 0.1% |
Most frequent character per script
Latin
Value | Count | Frequency (%) |
e | 442639 | |
t | 324077 | 9.2% |
a | 274843 | 7.8% |
o | 264412 | 7.5% |
n | 253954 | 7.2% |
s | 245698 | 6.9% |
i | 241682 | 6.8% |
r | 204655 | 5.8% |
h | 197047 | 5.6% |
l | 134361 | 3.8% |
Other values (51) | 956119 |
Common
Value | Count | Frequency (%) |
771616 | ||
. | 45900 | 5.0% |
, | 34895 | 3.8% |
' | 11580 | 1.3% |
- | 8486 | 0.9% |
" | 6975 | 0.8% |
6832 | 0.7% | |
4806 | 0.5% | |
1 | 2593 | 0.3% |
0 | 2106 | 0.2% |
Other values (50) | 23299 | 2.5% |
Gurmukhi
Value | Count | Frequency (%) |
ਤ | 1 |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 4458299 | |
None | 188 | < 0.1% |
Punctuation | 83 | < 0.1% |
Specials | 5 | < 0.1% |
Gurmukhi | 1 | < 0.1% |
Most frequent character per block
ASCII
Value | Count | Frequency (%) |
771616 | ||
e | 442639 | 9.9% |
t | 324077 | 7.3% |
a | 274843 | 6.2% |
o | 264412 | 5.9% |
n | 253954 | 5.7% |
s | 245698 | 5.5% |
i | 241682 | 5.4% |
r | 204655 | 4.6% |
h | 197047 | 4.4% |
Other values (88) | 1237676 |
None
Value | Count | Frequency (%) |
94 | ||
 | 41 | |
â | 21 | 11.2% |
| 5 | 2.7% |
ö | 5 | 2.7% |
é | 4 | 2.1% |
è | 3 | 1.6% |
ü | 3 | 1.6% |
à | 3 | 1.6% |
| 2 | 1.1% |
Other values (6) | 7 | 3.7% |
Punctuation
Value | Count | Frequency (%) |
’ | 35 | |
” | 20 | |
“ | 19 | |
‘ | 5 | 6.0% |
– | 3 | 3.6% |
— | 1 | 1.2% |
Specials
Value | Count | Frequency (%) |
� | 5 |
Gurmukhi
Value | Count | Frequency (%) |
ਤ | 1 |
min_length
Categorical
CONSTANT
MISSING
Distinct | 1 |
---|---|
Distinct (%) | 0.1% |
Missing | 324 |
Missing (%) | 15.9% |
Memory size | 113.5 KiB |
8.0 |
---|
Length
Max length | 3 |
---|---|
Median length | 3 |
Mean length | 3 |
Min length | 3 |
Characters and Unicode
Total characters | 5157 |
---|---|
Distinct characters | 3 |
Distinct categories | 2 ? |
Distinct scripts | 1 ? |
Distinct blocks | 1 ? |
Unique
Unique | 0 ? |
---|---|
Unique (%) | 0.0% |
Sample
1st row | 8.0 |
---|---|
2nd row | 8.0 |
3rd row | 8.0 |
4th row | 8.0 |
5th row | 8.0 |
Common Values
Value | Count | Frequency (%) |
8.0 | 1719 | |
(Missing) | 324 | 15.9% |
Length
Common Values (Plot)
Value | Count | Frequency (%) |
8.0 | 1719 |
Most occurring characters
Value | Count | Frequency (%) |
8 | 1719 | |
. | 1719 | |
0 | 1719 |
Most occurring categories
Value | Count | Frequency (%) |
Decimal Number | 3438 | |
Other Punctuation | 1719 |
Most frequent character per category
Decimal Number
Value | Count | Frequency (%) |
8 | 1719 | |
0 | 1719 |
Other Punctuation
Value | Count | Frequency (%) |
. | 1719 |
Most occurring scripts
Value | Count | Frequency (%) |
Common | 5157 |
Most frequent character per script
Common
Value | Count | Frequency (%) |
8 | 1719 | |
. | 1719 | |
0 | 1719 |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 5157 |
Most frequent character per block
ASCII
Value | Count | Frequency (%) |
8 | 1719 | |
. | 1719 | |
0 | 1719 |
max_length
Real number (ℝ)
Distinct | 10 |
---|---|
Distinct (%) | 0.5% |
Missing | 81 |
Missing (%) | 4.0% |
Infinite | 0 |
Infinite (%) | 0.0% |
Mean | 2217.4516 |
Minimum | 128 |
---|---|
Maximum | 4096 |
Zeros | 0 |
Zeros (%) | 0.0% |
Negative | 0 |
Negative (%) | 0.0% |
Memory size | 16.1 KiB |
Quantile statistics
Minimum | 128 |
---|---|
5-th percentile | 256 |
Q1 | 1024 |
median | 2048 |
Q3 | 4096 |
95-th percentile | 4096 |
Maximum | 4096 |
Range | 3968 |
Interquartile range (IQR) | 3072 |
Descriptive statistics
Standard deviation | 1401.9547 |
---|---|
Coefficient of variation (CV) | 0.6322369 |
Kurtosis | -1.4175937 |
Mean | 2217.4516 |
Median Absolute Deviation (MAD) | 1056 |
Skewness | 0.28000517 |
Sum | 4350640 |
Variance | 1965477 |
Monotonicity | Not monotonic |
Value | Count | Frequency (%) |
4096 | 613 | |
2048 | 512 | |
1024 | 414 | |
256 | 133 | 6.5% |
512 | 114 | 5.6% |
992 | 76 | 3.7% |
3276 | 34 | 1.7% |
1927 | 30 | 1.5% |
128 | 19 | 0.9% |
1638 | 17 | 0.8% |
(Missing) | 81 | 4.0% |
Value | Count | Frequency (%) |
128 | 19 | 0.9% |
256 | 133 | 6.5% |
512 | 114 | 5.6% |
992 | 76 | 3.7% |
1024 | 414 | |
1638 | 17 | 0.8% |
1927 | 30 | 1.5% |
2048 | 512 | |
3276 | 34 | 1.7% |
4096 | 613 |
Value | Count | Frequency (%) |
4096 | 613 | |
3276 | 34 | 1.7% |
2048 | 512 | |
1927 | 30 | 1.5% |
1638 | 17 | 0.8% |
1024 | 414 | |
992 | 76 | 3.7% |
512 | 114 | 5.6% |
256 | 133 | 6.5% |
128 | 19 | 0.9% |
no_repeat_ngram_size
Categorical
IMBALANCE
MISSING
Distinct | 2 |
---|---|
Distinct (%) | 0.1% |
Missing | 305 |
Missing (%) | 14.9% |
Memory size | 113.9 KiB |
3.0 | |
---|---|
4.0 | 100 |
Length
Max length | 3 |
---|---|
Median length | 3 |
Mean length | 3 |
Min length | 3 |
Characters and Unicode
Total characters | 5214 |
---|---|
Distinct characters | 4 |
Distinct categories | 2 ? |
Distinct scripts | 1 ? |
Distinct blocks | 1 ? |
Unique
Unique | 0 ? |
---|---|
Unique (%) | 0.0% |
Sample
1st row | 3.0 |
---|---|
2nd row | 3.0 |
3rd row | 3.0 |
4th row | 3.0 |
5th row | 3.0 |
Common Values
Value | Count | Frequency (%) |
3.0 | 1638 | |
4.0 | 100 | 4.9% |
(Missing) | 305 | 14.9% |
Length
Common Values (Plot)
Value | Count | Frequency (%) |
3.0 | 1638 | |
4.0 | 100 | 5.8% |
Most occurring characters
Value | Count | Frequency (%) |
. | 1738 | |
0 | 1738 | |
3 | 1638 | |
4 | 100 | 1.9% |
Most occurring categories
Value | Count | Frequency (%) |
Decimal Number | 3476 | |
Other Punctuation | 1738 |
Most frequent character per category
Decimal Number
Value | Count | Frequency (%) |
0 | 1738 | |
3 | 1638 | |
4 | 100 | 2.9% |
Other Punctuation
Value | Count | Frequency (%) |
. | 1738 |
Most occurring scripts
Value | Count | Frequency (%) |
Common | 5214 |
Most frequent character per script
Common
Value | Count | Frequency (%) |
. | 1738 | |
0 | 1738 | |
3 | 1638 | |
4 | 100 | 1.9% |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 5214 |
Most frequent character per block
ASCII
Value | Count | Frequency (%) |
. | 1738 | |
0 | 1738 | |
3 | 1638 | |
4 | 100 | 1.9% |
encoder_no_repeat_ngram_size
Categorical
CONSTANT
MISSING
Distinct | 1 |
---|---|
Distinct (%) | 0.1% |
Missing | 305 |
Missing (%) | 14.9% |
Memory size | 113.9 KiB |
4.0 |
---|
Length
Max length | 3 |
---|---|
Median length | 3 |
Mean length | 3 |
Min length | 3 |
Characters and Unicode
Total characters | 5214 |
---|---|
Distinct characters | 3 |
Distinct categories | 2 ? |
Distinct scripts | 1 ? |
Distinct blocks | 1 ? |
Unique
Unique | 0 ? |
---|---|
Unique (%) | 0.0% |
Sample
1st row | 4.0 |
---|---|
2nd row | 4.0 |
3rd row | 4.0 |
4th row | 4.0 |
5th row | 4.0 |
Common Values
Value | Count | Frequency (%) |
4.0 | 1738 | |
(Missing) | 305 | 14.9% |
Length
Common Values (Plot)
Value | Count | Frequency (%) |
4.0 | 1738 |
Most occurring characters
Value | Count | Frequency (%) |
4 | 1738 | |
. | 1738 | |
0 | 1738 |
Most occurring categories
Value | Count | Frequency (%) |
Decimal Number | 3476 | |
Other Punctuation | 1738 |
Most frequent character per category
Decimal Number
Value | Count | Frequency (%) |
4 | 1738 | |
0 | 1738 |
Other Punctuation
Value | Count | Frequency (%) |
. | 1738 |
Most occurring scripts
Value | Count | Frequency (%) |
Common | 5214 |
Most frequent character per script
Common
Value | Count | Frequency (%) |
4 | 1738 | |
. | 1738 | |
0 | 1738 |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 5214 |
Most frequent character per block
ASCII
Value | Count | Frequency (%) |
4 | 1738 | |
. | 1738 | |
0 | 1738 |
repetition_penalty
Categorical
IMBALANCE
MISSING
Distinct | 2 |
---|---|
Distinct (%) | 0.1% |
Missing | 324 |
Missing (%) | 15.9% |
Memory size | 113.5 KiB |
2.5 | |
---|---|
1.5 | 51 |
Length
Max length | 3 |
---|---|
Median length | 3 |
Mean length | 3 |
Min length | 3 |
Characters and Unicode
Total characters | 5157 |
---|---|
Distinct characters | 4 |
Distinct categories | 2 ? |
Distinct scripts | 1 ? |
Distinct blocks | 1 ? |
Unique
Unique | 0 ? |
---|---|
Unique (%) | 0.0% |
Sample
1st row | 2.5 |
---|---|
2nd row | 2.5 |
3rd row | 2.5 |
4th row | 2.5 |
5th row | 2.5 |
Common Values
Value | Count | Frequency (%) |
2.5 | 1668 | |
1.5 | 51 | 2.5% |
(Missing) | 324 | 15.9% |
Length
Common Values (Plot)
Value | Count | Frequency (%) |
2.5 | 1668 | |
1.5 | 51 | 3.0% |
Most occurring characters
Value | Count | Frequency (%) |
. | 1719 | |
5 | 1719 | |
2 | 1668 | |
1 | 51 | 1.0% |
Most occurring categories
Value | Count | Frequency (%) |
Decimal Number | 3438 | |
Other Punctuation | 1719 |
Most frequent character per category
Decimal Number
Value | Count | Frequency (%) |
5 | 1719 | |
2 | 1668 | |
1 | 51 | 1.5% |
Other Punctuation
Value | Count | Frequency (%) |
. | 1719 |
Most occurring scripts
Value | Count | Frequency (%) |
Common | 5157 |
Most frequent character per script
Common
Value | Count | Frequency (%) |
. | 1719 | |
5 | 1719 | |
2 | 1668 | |
1 | 51 | 1.0% |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 5157 |
Most frequent character per block
ASCII
Value | Count | Frequency (%) |
. | 1719 | |
5 | 1719 | |
2 | 1668 | |
1 | 51 | 1.0% |
num_beams
Real number (ℝ)
Distinct | 10 |
---|---|
Distinct (%) | 0.6% |
Missing | 233 |
Missing (%) | 11.4% |
Infinite | 0 |
Infinite (%) | 0.0% |
Mean | 7.6232044 |
Minimum | 1 |
---|---|
Maximum | 32 |
Zeros | 0 |
Zeros (%) | 0.0% |
Negative | 0 |
Negative (%) | 0.0% |
Memory size | 16.1 KiB |
Quantile statistics
Minimum | 1 |
---|---|
5-th percentile | 2 |
Q1 | 4 |
median | 8 |
Q3 | 8 |
95-th percentile | 16 |
Maximum | 32 |
Range | 31 |
Interquartile range (IQR) | 4 |
Descriptive statistics
Standard deviation | 4.8640458 |
---|---|
Coefficient of variation (CV) | 0.6380579 |
Kurtosis | 5.3809618 |
Mean | 7.6232044 |
Median Absolute Deviation (MAD) | 4 |
Skewness | 1.7613591 |
Sum | 13798 |
Variance | 23.658942 |
Monotonicity | Not monotonic |
Value | Count | Frequency (%) |
8 | 755 | |
4 | 499 | |
16 | 190 | 9.3% |
12 | 95 | 4.7% |
2 | 95 | 4.7% |
1 | 81 | 4.0% |
6 | 38 | 1.9% |
20 | 19 | 0.9% |
5 | 19 | 0.9% |
32 | 19 | 0.9% |
(Missing) | 233 | 11.4% |
Value | Count | Frequency (%) |
1 | 81 | 4.0% |
2 | 95 | 4.7% |
4 | 499 | |
5 | 19 | 0.9% |
6 | 38 | 1.9% |
8 | 755 | |
12 | 95 | 4.7% |
16 | 190 | 9.3% |
20 | 19 | 0.9% |
32 | 19 | 0.9% |
Value | Count | Frequency (%) |
32 | 19 | 0.9% |
20 | 19 | 0.9% |
16 | 190 | 9.3% |
12 | 95 | 4.7% |
8 | 755 | |
6 | 38 | 1.9% |
5 | 19 | 0.9% |
4 | 499 | |
2 | 95 | 4.7% |
1 | 81 | 4.0% |
num_beam_groups
Categorical
CONSTANT
MISSING
Distinct | 1 |
---|---|
Distinct (%) | 0.1% |
Missing | 324 |
Missing (%) | 15.9% |
Memory size | 113.5 KiB |
1.0 |
---|
Length
Max length | 3 |
---|---|
Median length | 3 |
Mean length | 3 |
Min length | 3 |
Characters and Unicode
Total characters | 5157 |
---|---|
Distinct characters | 3 |
Distinct categories | 2 ? |
Distinct scripts | 1 ? |
Distinct blocks | 1 ? |
Unique
Unique | 0 ? |
---|---|
Unique (%) | 0.0% |
Sample
1st row | 1.0 |
---|---|
2nd row | 1.0 |
3rd row | 1.0 |
4th row | 1.0 |
5th row | 1.0 |
Common Values
Value | Count | Frequency (%) |
1.0 | 1719 | |
(Missing) | 324 | 15.9% |
Length
Common Values (Plot)
Value | Count | Frequency (%) |
1.0 | 1719 |
Most occurring characters
Value | Count | Frequency (%) |
1 | 1719 | |
. | 1719 | |
0 | 1719 |
Most occurring categories
Value | Count | Frequency (%) |
Decimal Number | 3438 | |
Other Punctuation | 1719 |
Most frequent character per category
Decimal Number
Value | Count | Frequency (%) |
1 | 1719 | |
0 | 1719 |
Other Punctuation
Value | Count | Frequency (%) |
. | 1719 |
Most occurring scripts
Value | Count | Frequency (%) |
Common | 5157 |
Most frequent character per script
Common
Value | Count | Frequency (%) |
1 | 1719 | |
. | 1719 | |
0 | 1719 |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 5157 |
Most frequent character per block
ASCII
Value | Count | Frequency (%) |
1 | 1719 | |
. | 1719 | |
0 | 1719 |
length_penalty
Categorical
IMBALANCE
MISSING
Distinct | 2 |
---|---|
Distinct (%) | 0.1% |
Missing | 354 |
Missing (%) | 17.3% |
Memory size | 113.0 KiB |
0.8 | |
---|---|
0.75 | 34 |
Length
Max length | 4 |
---|---|
Median length | 3 |
Mean length | 3.0201303 |
Min length | 3 |
Characters and Unicode
Total characters | 5101 |
---|---|
Distinct characters | 5 |
Distinct categories | 2 ? |
Distinct scripts | 1 ? |
Distinct blocks | 1 ? |
Unique
Unique | 0 ? |
---|---|
Unique (%) | 0.0% |
Sample
1st row | 0.8 |
---|---|
2nd row | 0.8 |
3rd row | 0.8 |
4th row | 0.8 |
5th row | 0.8 |
Common Values
Value | Count | Frequency (%) |
0.8 | 1655 | |
0.75 | 34 | 1.7% |
(Missing) | 354 | 17.3% |
Length
Common Values (Plot)
Value | Count | Frequency (%) |
0.8 | 1655 | |
0.75 | 34 | 2.0% |
Most occurring characters
Value | Count | Frequency (%) |
0 | 1689 | |
. | 1689 | |
8 | 1655 | |
7 | 34 | 0.7% |
5 | 34 | 0.7% |
Most occurring categories
Value | Count | Frequency (%) |
Decimal Number | 3412 | |
Other Punctuation | 1689 |
Most frequent character per category
Decimal Number
Value | Count | Frequency (%) |
0 | 1689 | |
8 | 1655 | |
7 | 34 | 1.0% |
5 | 34 | 1.0% |
Other Punctuation
Value | Count | Frequency (%) |
. | 1689 |
Most occurring scripts
Value | Count | Frequency (%) |
Common | 5101 |
Most frequent character per script
Common
Value | Count | Frequency (%) |
0 | 1689 | |
. | 1689 | |
8 | 1655 | |
7 | 34 | 0.7% |
5 | 34 | 0.7% |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 5101 |
Most frequent character per block
ASCII
Value | Count | Frequency (%) |
0 | 1689 | |
. | 1689 | |
8 | 1655 | |
7 | 34 | 0.7% |
5 | 34 | 0.7% |
early_stopping
Boolean
CONSTANT
MISSING
Distinct | 1 |
---|---|
Distinct (%) | 0.1% |
Missing | 314 |
Missing (%) | 15.4% |
Memory size | 70.7 KiB |
True | |
---|---|
(Missing) |
Value | Count | Frequency (%) |
True | 1729 | |
(Missing) | 314 | 15.4% |
do_sample
Boolean
CONSTANT
MISSING
Distinct | 1 |
---|---|
Distinct (%) | 0.1% |
Missing | 324 |
Missing (%) | 15.9% |
Memory size | 64.0 KiB |
False | |
---|---|
(Missing) |
Value | Count | Frequency (%) |
False | 1719 | |
(Missing) | 324 | 15.9% |
model_name
Categorical
Distinct | 43 |
---|---|
Distinct (%) | 2.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 189.8 KiB |
pszemraj/long-t5-tglobal-base-16384-book-summary | |
---|---|
pszemraj/long-t5-tglobal-xl-16384-book-summary | 142 |
stacked-summaries/flan-t5-large-tinystack-booksum-1024-WIP1r2 | 95 |
stacked-summaries/flan-t5-large-stacked-samsum-1024 | 95 |
pszemraj/long-t5-tglobal-base-sci-simplify | 95 |
Other values (38) |
Length
Max length | 72 |
---|---|
Median length | 48 |
Mean length | 38.047479 |
Min length | 5 |
Characters and Unicode
Total characters | 77731 |
---|---|
Distinct characters | 54 |
Distinct categories | 6 ? |
Distinct scripts | 2 ? |
Distinct blocks | 1 ? |
Unique
Unique | 0 ? |
---|---|
Unique (%) | 0.0% |
Sample
1st row | jordiclive/flan-t5-3b-summarizer |
---|---|
2nd row | jordiclive/flan-t5-3b-summarizer |
3rd row | jordiclive/flan-t5-3b-summarizer |
4th row | jordiclive/flan-t5-3b-summarizer |
5th row | jordiclive/flan-t5-3b-summarizer |
Common Values
Value | Count | Frequency (%) |
pszemraj/long-t5-tglobal-base-16384-book-summary | 186 | 9.1% |
pszemraj/long-t5-tglobal-xl-16384-book-summary | 142 | 7.0% |
stacked-summaries/flan-t5-large-tinystack-booksum-1024-WIP1r2 | 95 | 4.7% |
stacked-summaries/flan-t5-large-stacked-samsum-1024 | 95 | 4.7% |
pszemraj/long-t5-tglobal-base-sci-simplify | 95 | 4.7% |
AleBurzio/long-t5-base-govreport | 95 | 4.7% |
pszemraj/led-base-book-summary | 95 | 4.7% |
Joemgu/pegasus-x-sumstew | 95 | 4.7% |
gpt-3.5-turbo | 76 | 3.7% |
gpt-4 | 76 | 3.7% |
Other values (33) | 993 |
Length
Value | Count | Frequency (%) |
pszemraj/long-t5-tglobal-base-16384-book-summary | 186 | 9.1% |
pszemraj/long-t5-tglobal-xl-16384-book-summary | 142 | 7.0% |
stacked-summaries/flan-t5-large-tinystack-booksum-1024-wip1r2 | 95 | 4.7% |
stacked-summaries/flan-t5-large-stacked-samsum-1024 | 95 | 4.7% |
pszemraj/long-t5-tglobal-base-sci-simplify | 95 | 4.7% |
aleburzio/long-t5-base-govreport | 95 | 4.7% |
pszemraj/led-base-book-summary | 95 | 4.7% |
joemgu/pegasus-x-sumstew | 95 | 4.7% |
pszemraj/long-t5-tglobal-base-scientific_lay_summarisation-elife-norm-r1 | 76 | 3.7% |
gpt-3.5-turbo | 76 | 3.7% |
Other values (33) | 993 |
Most occurring characters
Value | Count | Frequency (%) |
- | 8336 | 10.7% |
a | 6216 | 8.0% |
s | 5603 | 7.2% |
e | 5183 | 6.7% |
m | 4524 | 5.8% |
r | 4336 | 5.6% |
l | 4290 | 5.5% |
o | 4271 | 5.5% |
t | 3416 | 4.4% |
g | 2942 | 3.8% |
Other values (44) | 28614 |
Most occurring categories
Value | Count | Frequency (%) |
Lowercase Letter | 61007 | |
Dash Punctuation | 8336 | 10.7% |
Decimal Number | 5427 | 7.0% |
Other Punctuation | 1848 | 2.4% |
Uppercase Letter | 895 | 1.2% |
Connector Punctuation | 218 | 0.3% |
Most frequent character per category
Lowercase Letter
Value | Count | Frequency (%) |
a | 6216 | 10.2% |
s | 5603 | 9.2% |
e | 5183 | 8.5% |
m | 4524 | 7.4% |
r | 4336 | 7.1% |
l | 4290 | 7.0% |
o | 4271 | 7.0% |
t | 3416 | 5.6% |
g | 2942 | 4.8% |
b | 2870 | 4.7% |
Other values (15) | 17356 |
Uppercase Letter
Value | Count | Frequency (%) |
A | 128 | |
I | 121 | |
B | 95 | |
W | 95 | |
J | 95 | |
P | 95 | |
M | 57 | |
E | 38 | 4.2% |
K | 38 | 4.2% |
T | 19 | 2.1% |
Other values (6) | 114 |
Decimal Number
Value | Count | Frequency (%) |
5 | 1088 | |
1 | 1016 | |
4 | 845 | |
3 | 636 | |
6 | 522 | |
8 | 484 | |
2 | 456 | |
0 | 323 | 6.0% |
9 | 57 | 1.1% |
Other Punctuation
Value | Count | Frequency (%) |
/ | 1772 | |
. | 76 | 4.1% |
Dash Punctuation
Value | Count | Frequency (%) |
- | 8336 |
Connector Punctuation
Value | Count | Frequency (%) |
_ | 218 |
Most occurring scripts
Value | Count | Frequency (%) |
Latin | 61902 | |
Common | 15829 | 20.4% |
Most frequent character per script
Latin
Value | Count | Frequency (%) |
a | 6216 | 10.0% |
s | 5603 | 9.1% |
e | 5183 | 8.4% |
m | 4524 | 7.3% |
r | 4336 | 7.0% |
l | 4290 | 6.9% |
o | 4271 | 6.9% |
t | 3416 | 5.5% |
g | 2942 | 4.8% |
b | 2870 | 4.6% |
Other values (31) | 18251 |
Common
Value | Count | Frequency (%) |
- | 8336 | |
/ | 1772 | 11.2% |
5 | 1088 | 6.9% |
1 | 1016 | 6.4% |
4 | 845 | 5.3% |
3 | 636 | 4.0% |
6 | 522 | 3.3% |
8 | 484 | 3.1% |
2 | 456 | 2.9% |
0 | 323 | 2.0% |
Other values (3) | 351 | 2.2% |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 77731 |
Most frequent character per block
ASCII
Value | Count | Frequency (%) |
- | 8336 | 10.7% |
a | 6216 | 8.0% |
s | 5603 | 7.2% |
e | 5183 | 6.7% |
m | 4524 | 5.8% |
r | 4336 | 5.6% |
l | 4290 | 5.5% |
o | 4271 | 5.5% |
t | 3416 | 4.4% |
g | 2942 | 3.8% |
Other values (44) | 28614 |
date
Categorical
HIGH CARDINALITY
MISSING
Distinct | 96 |
---|---|
Distinct (%) | 5.3% |
Missing | 243 |
Missing (%) | 11.9% |
Memory size | 134.0 KiB |
Feb-17-2023 | 47 |
---|---|
Feb-21-2023 | 34 |
20230318_061812 | 19 |
20230318_055638 | 19 |
20230409_020526 | 19 |
Other values (91) |
Length
Max length | 17 |
---|---|
Median length | 15 |
Mean length | 14.840556 |
Min length | 8 |
Characters and Unicode
Total characters | 26713 |
---|---|
Distinct characters | 20 |
Distinct categories | 7 ? |
Distinct scripts | 2 ? |
Distinct blocks | 1 ? |
Unique
Unique | 0 ? |
---|---|
Unique (%) | 0.0% |
Sample
1st row | 20230524_080731 |
---|---|
2nd row | 20230524_080731 |
3rd row | 20230524_080731 |
4th row | 20230524_080731 |
5th row | 20230524_080731 |
Common Values
Value | Count | Frequency (%) |
Feb-17-2023 | 47 | 2.3% |
Feb-21-2023 | 34 | 1.7% |
20230318_061812 | 19 | 0.9% |
20230318_055638 | 19 | 0.9% |
20230409_020526 | 19 | 0.9% |
20230408_160437 | 19 | 0.9% |
20230220_214802 | 19 | 0.9% |
20230316_150446 | 19 | 0.9% |
20230316_134625 | 19 | 0.9% |
20230220_205040 | 19 | 0.9% |
Other values (86) | 1567 | |
(Missing) | 243 | 11.9% |
Length
Value | Count | Frequency (%) |
2023-feb-27 | 81 | 4.3% |
feb-17-2023 | 47 | 2.5% |
feb-21-2023 | 34 | 1.8% |
20230315_234849 | 19 | 1.0% |
20230318_040524 | 19 | 1.0% |
20230318_022628 | 19 | 1.0% |
20230318_022921 | 19 | 1.0% |
20230505_151208 | 19 | 1.0% |
20230524_061826 | 19 | 1.0% |
20230316_015916 | 19 | 1.0% |
Other values (87) | 1586 |
Most occurring characters
Value | Count | Frequency (%) |
0 | 5978 | |
2 | 5887 | |
3 | 3738 | |
1 | 2180 | 8.2% |
5 | 1862 | 7.0% |
_ | 1612 | 6.0% |
4 | 1588 | 5.9% |
6 | 930 | 3.5% |
8 | 824 | 3.1% |
7 | 733 | 2.7% |
Other values (10) | 1381 | 5.2% |
Most occurring categories
Value | Count | Frequency (%) |
Decimal Number | 24034 | |
Connector Punctuation | 1612 | 6.0% |
Dash Punctuation | 362 | 1.4% |
Lowercase Letter | 362 | 1.4% |
Uppercase Letter | 181 | 0.7% |
Space Separator | 81 | 0.3% |
Other Punctuation | 81 | 0.3% |
Most frequent character per category
Decimal Number
Value | Count | Frequency (%) |
0 | 5978 | |
2 | 5887 | |
3 | 3738 | |
1 | 2180 | 9.1% |
5 | 1862 | 7.7% |
4 | 1588 | 6.6% |
6 | 930 | 3.9% |
8 | 824 | 3.4% |
7 | 733 | 3.0% |
9 | 314 | 1.3% |
Lowercase Letter
Value | Count | Frequency (%) |
e | 162 | |
b | 162 | |
a | 19 | 5.2% |
r | 19 | 5.2% |
Uppercase Letter
Value | Count | Frequency (%) |
F | 162 | |
M | 19 | 10.5% |
Connector Punctuation
Value | Count | Frequency (%) |
_ | 1612 |
Dash Punctuation
Value | Count | Frequency (%) |
- | 362 |
Space Separator
Value | Count | Frequency (%) |
81 |
Other Punctuation
Value | Count | Frequency (%) |
: | 81 |
Most occurring scripts
Value | Count | Frequency (%) |
Common | 26170 | |
Latin | 543 | 2.0% |
Most frequent character per script
Common
Value | Count | Frequency (%) |
0 | 5978 | |
2 | 5887 | |
3 | 3738 | |
1 | 2180 | 8.3% |
5 | 1862 | 7.1% |
_ | 1612 | 6.2% |
4 | 1588 | 6.1% |
6 | 930 | 3.6% |
8 | 824 | 3.1% |
7 | 733 | 2.8% |
Other values (4) | 838 | 3.2% |
Latin
Value | Count | Frequency (%) |
F | 162 | |
e | 162 | |
b | 162 | |
M | 19 | 3.5% |
a | 19 | 3.5% |
r | 19 | 3.5% |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 26713 |
Most frequent character per block
ASCII
Value | Count | Frequency (%) |
0 | 5978 | |
2 | 5887 | |
3 | 3738 | |
1 | 2180 | 8.2% |
5 | 1862 | 7.0% |
_ | 1612 | 6.0% |
4 | 1588 | 5.9% |
6 | 930 | 3.5% |
8 | 824 | 3.1% |
7 | 733 | 2.7% |
Other values (10) | 1381 | 5.2% |
length
Categorical
Distinct | 2 |
---|---|
Distinct (%) | 2.5% |
Missing | 1962 |
Missing (%) | 96.0% |
Memory size | 66.3 KiB |
long | |
---|---|
medium |
Length
Max length | 6 |
---|---|
Median length | 4 |
Mean length | 4.3209877 |
Min length | 4 |
Characters and Unicode
Total characters | 350 |
---|---|
Distinct characters | 9 |
Distinct categories | 1 ? |
Distinct scripts | 1 ? |
Distinct blocks | 1 ? |
Unique
Unique | 0 ? |
---|---|
Unique (%) | 0.0% |
Sample
1st row | medium |
---|---|
2nd row | medium |
3rd row | medium |
4th row | medium |
5th row | medium |
Common Values
Value | Count | Frequency (%) |
long | 68 | 3.3% |
medium | 13 | 0.6% |
(Missing) | 1962 |
Length
Common Values (Plot)
Value | Count | Frequency (%) |
long | 68 | |
medium | 13 | 16.0% |
Most occurring characters
Value | Count | Frequency (%) |
l | 68 | |
o | 68 | |
n | 68 | |
g | 68 | |
m | 26 | 7.4% |
e | 13 | 3.7% |
d | 13 | 3.7% |
i | 13 | 3.7% |
u | 13 | 3.7% |
Most occurring categories
Value | Count | Frequency (%) |
Lowercase Letter | 350 |
Most frequent character per category
Lowercase Letter
Value | Count | Frequency (%) |
l | 68 | |
o | 68 | |
n | 68 | |
g | 68 | |
m | 26 | 7.4% |
e | 13 | 3.7% |
d | 13 | 3.7% |
i | 13 | 3.7% |
u | 13 | 3.7% |
Most occurring scripts
Value | Count | Frequency (%) |
Latin | 350 |
Most frequent character per script
Latin
Value | Count | Frequency (%) |
l | 68 | |
o | 68 | |
n | 68 | |
g | 68 | |
m | 26 | 7.4% |
e | 13 | 3.7% |
d | 13 | 3.7% |
i | 13 | 3.7% |
u | 13 | 3.7% |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 350 |
Most frequent character per block
ASCII
Value | Count | Frequency (%) |
l | 68 | |
o | 68 | |
n | 68 | |
g | 68 | |
m | 26 | 7.4% |
e | 13 | 3.7% |
d | 13 | 3.7% |
i | 13 | 3.7% |
u | 13 | 3.7% |
format
Categorical
CONSTANT
MISSING
Distinct | 1 |
---|---|
Distinct (%) | 1.2% |
Missing | 1962 |
Missing (%) | 96.0% |
Memory size | 66.7 KiB |
paragraph |
---|
Length
Max length | 9 |
---|---|
Median length | 9 |
Mean length | 9 |
Min length | 9 |
Characters and Unicode
Total characters | 729 |
---|---|
Distinct characters | 5 |
Distinct categories | 1 ? |
Distinct scripts | 1 ? |
Distinct blocks | 1 ? |
Unique
Unique | 0 ? |
---|---|
Unique (%) | 0.0% |
Sample
1st row | paragraph |
---|---|
2nd row | paragraph |
3rd row | paragraph |
4th row | paragraph |
5th row | paragraph |
Common Values
Value | Count | Frequency (%) |
paragraph | 81 | 4.0% |
(Missing) | 1962 |
Length
Common Values (Plot)
Value | Count | Frequency (%) |
paragraph | 81 |
Most occurring characters
Value | Count | Frequency (%) |
a | 243 | |
p | 162 | |
r | 162 | |
g | 81 | 11.1% |
h | 81 | 11.1% |
Most occurring categories
Value | Count | Frequency (%) |
Lowercase Letter | 729 |
Most frequent character per category
Lowercase Letter
Value | Count | Frequency (%) |
a | 243 | |
p | 162 | |
r | 162 | |
g | 81 | 11.1% |
h | 81 | 11.1% |
Most occurring scripts
Value | Count | Frequency (%) |
Latin | 729 |
Most frequent character per script
Latin
Value | Count | Frequency (%) |
a | 243 | |
p | 162 | |
r | 162 | |
g | 81 | 11.1% |
h | 81 | 11.1% |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 729 |
Most frequent character per block
ASCII
Value | Count | Frequency (%) |
a | 243 | |
p | 162 | |
r | 162 | |
g | 81 | 11.1% |
h | 81 | 11.1% |
extractiveness
Categorical
Distinct | 2 |
---|---|
Distinct (%) | 2.5% |
Missing | 1962 |
Missing (%) | 96.0% |
Memory size | 66.3 KiB |
low | |
---|---|
medium |
Length
Max length | 6 |
---|---|
Median length | 3 |
Mean length | 4.2592593 |
Min length | 3 |
Characters and Unicode
Total characters | 345 |
---|---|
Distinct characters | 8 |
Distinct categories | 1 ? |
Distinct scripts | 1 ? |
Distinct blocks | 1 ? |
Unique
Unique | 0 ? |
---|---|
Unique (%) | 0.0% |
Sample
1st row | low |
---|---|
2nd row | low |
3rd row | low |
4th row | low |
5th row | low |
Common Values
Value | Count | Frequency (%) |
low | 47 | 2.3% |
medium | 34 | 1.7% |
(Missing) | 1962 |
Length
Common Values (Plot)
Value | Count | Frequency (%) |
low | 47 | |
medium | 34 |
Most occurring characters
Value | Count | Frequency (%) |
m | 68 | |
l | 47 | |
o | 47 | |
w | 47 | |
e | 34 | |
d | 34 | |
i | 34 | |
u | 34 |
Most occurring categories
Value | Count | Frequency (%) |
Lowercase Letter | 345 |
Most frequent character per category
Lowercase Letter
Value | Count | Frequency (%) |
m | 68 | |
l | 47 | |
o | 47 | |
w | 47 | |
e | 34 | |
d | 34 | |
i | 34 | |
u | 34 |
Most occurring scripts
Value | Count | Frequency (%) |
Latin | 345 |
Most frequent character per script
Latin
Value | Count | Frequency (%) |
m | 68 | |
l | 47 | |
o | 47 | |
w | 47 | |
e | 34 | |
d | 34 | |
i | 34 | |
u | 34 |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 345 |
Most frequent character per block
ASCII
Value | Count | Frequency (%) |
m | 68 | |
l | 47 | |
o | 47 | |
w | 47 | |
e | 34 | |
d | 34 | |
i | 34 | |
u | 34 |
temperature
Categorical
Distinct | 2 |
---|---|
Distinct (%) | 2.5% |
Missing | 1962 |
Missing (%) | 96.0% |
Memory size | 81.5 KiB |
0.5 | |
---|---|
1.0 |
Length
Max length | 3 |
---|---|
Median length | 3 |
Mean length | 3 |
Min length | 3 |
Characters and Unicode
Total characters | 243 |
---|---|
Distinct characters | 4 |
Distinct categories | 2 ? |
Distinct scripts | 1 ? |
Distinct blocks | 1 ? |
Unique
Unique | 0 ? |
---|---|
Unique (%) | 0.0% |
Sample
1st row | 0.5 |
---|---|
2nd row | 0.5 |
3rd row | 0.5 |
4th row | 0.5 |
5th row | 0.5 |
Common Values
Value | Count | Frequency (%) |
0.5 | 66 | 3.2% |
1.0 | 15 | 0.7% |
(Missing) | 1962 |
Length
Common Values (Plot)
Value | Count | Frequency (%) |
0.5 | 66 | |
1.0 | 15 | 18.5% |
Most occurring characters
Value | Count | Frequency (%) |
0 | 81 | |
. | 81 | |
5 | 66 | |
1 | 15 | 6.2% |
Most occurring categories
Value | Count | Frequency (%) |
Decimal Number | 162 | |
Other Punctuation | 81 |
Most frequent character per category
Decimal Number
Value | Count | Frequency (%) |
0 | 81 | |
5 | 66 | |
1 | 15 | 9.3% |
Other Punctuation
Value | Count | Frequency (%) |
. | 81 |
Most occurring scripts
Value | Count | Frequency (%) |
Common | 243 |
Most frequent character per script
Common
Value | Count | Frequency (%) |
0 | 81 | |
. | 81 | |
5 | 66 | |
1 | 15 | 6.2% |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 243 |
Most frequent character per block
ASCII
Value | Count | Frequency (%) |
0 | 81 | |
. | 81 | |
5 | 66 | |
1 | 15 | 6.2% |
token_batch_length
Real number (ℝ)
Distinct | 6 |
---|---|
Distinct (%) | 1.7% |
Missing | 1700 |
Missing (%) | 83.2% |
Infinite | 0 |
Infinite (%) | 0.0% |
Mean | 13886.321 |
Minimum | 1024 |
---|---|
Maximum | 32768 |
Zeros | 0 |
Zeros (%) | 0.0% |
Negative | 0 |
Negative (%) | 0.0% |
Memory size | 16.1 KiB |
Quantile statistics
Minimum | 1024 |
---|---|
5-th percentile | 1024 |
Q1 | 3584 |
median | 8192 |
Q3 | 32768 |
95-th percentile | 32768 |
Maximum | 32768 |
Range | 31744 |
Interquartile range (IQR) | 29184 |
Descriptive statistics
Standard deviation | 11943.674 |
---|---|
Coefficient of variation (CV) | 0.86010355 |
Kurtosis | -1.1042253 |
Mean | 13886.321 |
Median Absolute Deviation (MAD) | 4608 |
Skewness | 0.79092605 |
Sum | 4763008 |
Variance | 1.4265134 × 108 |
Monotonicity | Not monotonic |
Value | Count | Frequency (%) |
32768 | 91 | 4.5% |
7200 | 76 | 3.7% |
3584 | 76 | 3.7% |
8192 | 47 | 2.3% |
16384 | 34 | 1.7% |
1024 | 19 | 0.9% |
(Missing) | 1700 |
Value | Count | Frequency (%) |
1024 | 19 | 0.9% |
3584 | 76 | |
7200 | 76 | |
8192 | 47 | |
16384 | 34 | 1.7% |
32768 | 91 |
Value | Count | Frequency (%) |
32768 | 91 | |
16384 | 34 | 1.7% |
8192 | 47 | |
7200 | 76 | |
3584 | 76 | |
1024 | 19 | 0.9% |
penalty_alpha
Categorical
CONSTANT
MISSING
Distinct | 1 |
---|---|
Distinct (%) | 1.2% |
Missing | 1962 |
Missing (%) | 96.0% |
Memory size | 81.5 KiB |
0.6 |
---|
Length
Max length | 3 |
---|---|
Median length | 3 |
Mean length | 3 |
Min length | 3 |
Characters and Unicode
Total characters | 243 |
---|---|
Distinct characters | 3 |
Distinct categories | 2 ? |
Distinct scripts | 1 ? |
Distinct blocks | 1 ? |
Unique
Unique | 0 ? |
---|---|
Unique (%) | 0.0% |
Sample
1st row | 0.6 |
---|---|
2nd row | 0.6 |
3rd row | 0.6 |
4th row | 0.6 |
5th row | 0.6 |
Common Values
Value | Count | Frequency (%) |
0.6 | 81 | 4.0% |
(Missing) | 1962 |
Length
Common Values (Plot)
Value | Count | Frequency (%) |
0.6 | 81 |
Most occurring characters
Value | Count | Frequency (%) |
0 | 81 | |
. | 81 | |
6 | 81 |
Most occurring categories
Value | Count | Frequency (%) |
Decimal Number | 162 | |
Other Punctuation | 81 |
Most frequent character per category
Decimal Number
Value | Count | Frequency (%) |
0 | 81 | |
6 | 81 |
Other Punctuation
Value | Count | Frequency (%) |
. | 81 |
Most occurring scripts
Value | Count | Frequency (%) |
Common | 243 |
Most frequent character per script
Common
Value | Count | Frequency (%) |
0 | 81 | |
. | 81 | |
6 | 81 |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 243 |
Most frequent character per block
ASCII
Value | Count | Frequency (%) |
0 | 81 | |
. | 81 | |
6 | 81 |
top_k
Categorical
Distinct | 2 |
---|---|
Distinct (%) | 2.5% |
Missing | 1962 |
Missing (%) | 96.0% |
Memory size | 81.5 KiB |
4.0 | |
---|---|
8.0 |
Length
Max length | 3 |
---|---|
Median length | 3 |
Mean length | 3 |
Min length | 3 |
Characters and Unicode
Total characters | 243 |
---|---|
Distinct characters | 4 |
Distinct categories | 2 ? |
Distinct scripts | 1 ? |
Distinct blocks | 1 ? |
Unique
Unique | 0 ? |
---|---|
Unique (%) | 0.0% |
Sample
1st row | 4.0 |
---|---|
2nd row | 4.0 |
3rd row | 4.0 |
4th row | 4.0 |
5th row | 4.0 |
Common Values
Value | Count | Frequency (%) |
4.0 | 70 | 3.4% |
8.0 | 11 | 0.5% |
(Missing) | 1962 |
Length
Common Values (Plot)
Value | Count | Frequency (%) |
4.0 | 70 | |
8.0 | 11 | 13.6% |
Most occurring characters
Value | Count | Frequency (%) |
. | 81 | |
0 | 81 | |
4 | 70 | |
8 | 11 | 4.5% |
Most occurring categories
Value | Count | Frequency (%) |
Decimal Number | 162 | |
Other Punctuation | 81 |
Most frequent character per category
Decimal Number
Value | Count | Frequency (%) |
0 | 81 | |
4 | 70 | |
8 | 11 | 6.8% |
Other Punctuation
Value | Count | Frequency (%) |
. | 81 |
Most occurring scripts
Value | Count | Frequency (%) |
Common | 243 |
Most frequent character per script
Common
Value | Count | Frequency (%) |
. | 81 | |
0 | 81 | |
4 | 70 | |
8 | 11 | 4.5% |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 243 |
Most frequent character per block
ASCII
Value | Count | Frequency (%) |
. | 81 | |
0 | 81 | |
4 | 70 | |
8 | 11 | 4.5% |
batch_stride
Categorical
Distinct | 2 |
---|---|
Distinct (%) | 0.8% |
Missing | 1791 |
Missing (%) | 87.7% |
Memory size | 84.9 KiB |
0.0 | |
---|---|
24.0 |
Length
Max length | 4 |
---|---|
Median length | 3 |
Mean length | 3.3968254 |
Min length | 3 |
Characters and Unicode
Total characters | 856 |
---|---|
Distinct characters | 4 |
Distinct categories | 2 ? |
Distinct scripts | 1 ? |
Distinct blocks | 1 ? |
Unique
Unique | 0 ? |
---|---|
Unique (%) | 0.0% |
Sample
1st row | 24.0 |
---|---|
2nd row | 24.0 |
3rd row | 24.0 |
4th row | 24.0 |
5th row | 24.0 |
Common Values
Value | Count | Frequency (%) |
0.0 | 152 | 7.4% |
24.0 | 100 | 4.9% |
(Missing) | 1791 |
Length
Common Values (Plot)
Value | Count | Frequency (%) |
0.0 | 152 | |
24.0 | 100 |
Most occurring characters
Value | Count | Frequency (%) |
0 | 404 | |
. | 252 | |
2 | 100 | 11.7% |
4 | 100 | 11.7% |
Most occurring categories
Value | Count | Frequency (%) |
Decimal Number | 604 | |
Other Punctuation | 252 |
Most frequent character per category
Decimal Number
Value | Count | Frequency (%) |
0 | 404 | |
2 | 100 | 16.6% |
4 | 100 | 16.6% |
Other Punctuation
Value | Count | Frequency (%) |
. | 252 |
Most occurring scripts
Value | Count | Frequency (%) |
Common | 856 |
Most frequent character per script
Common
Value | Count | Frequency (%) |
0 | 404 | |
. | 252 | |
2 | 100 | 11.7% |
4 | 100 | 11.7% |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 856 |
Most frequent character per block
ASCII
Value | Count | Frequency (%) |
0 | 404 | |
. | 252 | |
2 | 100 | 11.7% |
4 | 100 | 11.7% |
max_len_ratio
Categorical
Distinct | 3 |
---|---|
Distinct (%) | 3.0% |
Missing | 1943 |
Missing (%) | 95.1% |
Memory size | 81.9 KiB |
5.0 | |
---|---|
4.25 | |
4.0 |
Length
Max length | 4 |
---|---|
Median length | 3 |
Mean length | 3.3 |
Min length | 3 |
Characters and Unicode
Total characters | 330 |
---|---|
Distinct characters | 5 |
Distinct categories | 2 ? |
Distinct scripts | 1 ? |
Distinct blocks | 1 ? |
Unique
Unique | 0 ? |
---|---|
Unique (%) | 0.0% |
Sample
1st row | 5.0 |
---|---|
2nd row | 5.0 |
3rd row | 5.0 |
4th row | 5.0 |
5th row | 5.0 |
Common Values
Value | Count | Frequency (%) |
5.0 | 51 | 2.5% |
4.25 | 30 | 1.5% |
4.0 | 19 | 0.9% |
(Missing) | 1943 |
Length
Common Values (Plot)
Value | Count | Frequency (%) |
5.0 | 51 | |
4.25 | 30 | |
4.0 | 19 | 19.0% |
Most occurring characters
Value | Count | Frequency (%) |
. | 100 | |
5 | 81 | |
0 | 70 | |
4 | 49 | |
2 | 30 | 9.1% |
Most occurring categories
Value | Count | Frequency (%) |
Decimal Number | 230 | |
Other Punctuation | 100 |
Most frequent character per category
Decimal Number
Value | Count | Frequency (%) |
5 | 81 | |
0 | 70 | |
4 | 49 | |
2 | 30 | 13.0% |
Other Punctuation
Value | Count | Frequency (%) |
. | 100 |
Most occurring scripts
Value | Count | Frequency (%) |
Common | 330 |
Most frequent character per script
Common
Value | Count | Frequency (%) |
. | 100 | |
5 | 81 | |
0 | 70 | |
4 | 49 | |
2 | 30 | 9.1% |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 330 |
Most frequent character per block
ASCII
Value | Count | Frequency (%) |
. | 100 | |
5 | 81 | |
0 | 70 | |
4 | 49 | |
2 | 30 | 9.1% |
directory-topic-tag
Categorical
Distinct | 6 |
---|---|
Distinct (%) | 6.0% |
Missing | 1943 |
Missing (%) | 95.1% |
Memory size | 70.1 KiB |
gauntlet-csearch-tglobal-XL-public | |
---|---|
gaunlet-flan-t5-large-xsum-r1 | |
gauntlet-csearch-16384-topk4-longt5-base | |
gauntlet-csearch-8192-topk4-longt5-base | |
gauntlet-csearch-16384-len-penalty-tglobal-XL-public |
Length
Max length | 52 |
---|---|
Median length | 39 |
Mean length | 38.04 |
Min length | 29 |
Characters and Unicode
Total characters | 3804 |
---|---|
Distinct characters | 31 |
Distinct categories | 4 ? |
Distinct scripts | 2 ? |
Distinct blocks | 1 ? |
Unique
Unique | 1 ? |
---|---|
Unique (%) | 1.0% |
Sample
1st row | gauntlet-csearch-16384-topk4-longt5-base |
---|---|
2nd row | gauntlet-csearch-16384-topk4-longt5-base |
3rd row | gauntlet-csearch-16384-topk4-longt5-base |
4th row | gauntlet-csearch-16384-topk4-longt5-base |
5th row | gauntlet-csearch-16384-topk4-longt5-base |
Common Values
Value | Count | Frequency (%) |
gauntlet-csearch-tglobal-XL-public | 29 | 1.4% |
gaunlet-flan-t5-large-xsum-r1 | 19 | 0.9% |
gauntlet-csearch-16384-topk4-longt5-base | 17 | 0.8% |
gauntlet-csearch-8192-topk4-longt5-base | 17 | 0.8% |
gauntlet-csearch-16384-len-penalty-tglobal-XL-public | 17 | 0.8% |
gauntlet-csearch-16384-tglobal-XL-public | 1 | < 0.1% |
(Missing) | 1943 |
Length
Common Values (Plot)
Value | Count | Frequency (%) |
gauntlet-csearch-tglobal-xl-public | 29 | |
gaunlet-flan-t5-large-xsum-r1 | 19 | |
gauntlet-csearch-16384-topk4-longt5-base | 17 | |
gauntlet-csearch-8192-topk4-longt5-base | 17 | |
gauntlet-csearch-16384-len-penalty-tglobal-xl-public | 17 | |
gauntlet-csearch-16384-tglobal-xl-public | 1 | 1.0% |
Most occurring characters
Value | Count | Frequency (%) |
- | 505 | |
l | 347 | 9.1% |
t | 332 | 8.7% |
a | 317 | 8.3% |
e | 268 | 7.0% |
c | 209 | 5.5% |
g | 200 | 5.3% |
n | 187 | 4.9% |
u | 166 | 4.4% |
s | 134 | 3.5% |
Other values (21) | 1139 |
Most occurring categories
Value | Count | Frequency (%) |
Lowercase Letter | 2856 | |
Dash Punctuation | 505 | 13.3% |
Decimal Number | 349 | 9.2% |
Uppercase Letter | 94 | 2.5% |
Most frequent character per category
Lowercase Letter
Value | Count | Frequency (%) |
l | 347 | |
t | 332 | |
a | 317 | |
e | 268 | |
c | 209 | 7.3% |
g | 200 | 7.0% |
n | 187 | 6.5% |
u | 166 | 5.8% |
s | 134 | 4.7% |
b | 128 | 4.5% |
Other values (10) | 568 |
Decimal Number
Value | Count | Frequency (%) |
1 | 71 | |
4 | 69 | |
5 | 53 | |
8 | 52 | |
6 | 35 | |
3 | 35 | |
9 | 17 | 4.9% |
2 | 17 | 4.9% |
Uppercase Letter
Value | Count | Frequency (%) |
L | 47 | |
X | 47 |
Dash Punctuation
Value | Count | Frequency (%) |
- | 505 |
Most occurring scripts
Value | Count | Frequency (%) |
Latin | 2950 | |
Common | 854 | 22.5% |
Most frequent character per script
Latin
Value | Count | Frequency (%) |
l | 347 | |
t | 332 | |
a | 317 | |
e | 268 | |
c | 209 | 7.1% |
g | 200 | 6.8% |
n | 187 | 6.3% |
u | 166 | 5.6% |
s | 134 | 4.5% |
b | 128 | 4.3% |
Other values (12) | 662 |
Common
Value | Count | Frequency (%) |
- | 505 | |
1 | 71 | 8.3% |
4 | 69 | 8.1% |
5 | 53 | 6.2% |
8 | 52 | 6.1% |
6 | 35 | 4.1% |
3 | 35 | 4.1% |
9 | 17 | 2.0% |
2 | 17 | 2.0% |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 3804 |
Most frequent character per block
ASCII
Value | Count | Frequency (%) |
- | 505 | |
l | 347 | 9.1% |
t | 332 | 8.7% |
a | 317 | 8.3% |
e | 268 | 7.0% |
c | 209 | 5.5% |
g | 200 | 5.3% |
n | 187 | 4.9% |
u | 166 | 4.4% |
s | 134 | 3.5% |
Other values (21) | 1139 |
runtime
Categorical
CONSTANT
MISSING
Distinct | 1 |
---|---|
Distinct (%) | 1.3% |
Missing | 1967 |
Missing (%) | 96.3% |
Memory size | 66.2 KiB |
45:30 |
---|
Length
Max length | 5 |
---|---|
Median length | 5 |
Mean length | 5 |
Min length | 5 |
Characters and Unicode
Total characters | 380 |
---|---|
Distinct characters | 5 |
Distinct categories | 2 ? |
Distinct scripts | 1 ? |
Distinct blocks | 1 ? |
Unique
Unique | 0 ? |
---|---|
Unique (%) | 0.0% |
Sample
1st row | 45:30 |
---|---|
2nd row | 45:30 |
3rd row | 45:30 |
4th row | 45:30 |
5th row | 45:30 |
Common Values
Value | Count | Frequency (%) |
45:30 | 76 | 3.7% |
(Missing) | 1967 |
Length
Common Values (Plot)
Value | Count | Frequency (%) |
45:30 | 76 |
Most occurring characters
Value | Count | Frequency (%) |
4 | 76 | |
5 | 76 | |
: | 76 | |
3 | 76 | |
0 | 76 |
Most occurring categories
Value | Count | Frequency (%) |
Decimal Number | 304 | |
Other Punctuation | 76 | 20.0% |
Most frequent character per category
Decimal Number
Value | Count | Frequency (%) |
4 | 76 | |
5 | 76 | |
3 | 76 | |
0 | 76 |
Other Punctuation
Value | Count | Frequency (%) |
: | 76 |
Most occurring scripts
Value | Count | Frequency (%) |
Common | 380 |
Most frequent character per script
Common
Value | Count | Frequency (%) |
4 | 76 | |
5 | 76 | |
: | 76 | |
3 | 76 | |
0 | 76 |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 380 |
Most frequent character per block
ASCII
Value | Count | Frequency (%) |
4 | 76 | |
5 | 76 | |
: | 76 | |
3 | 76 | |
0 | 76 |
source_doc_filename
Categorical
Distinct | 19 |
---|---|
Distinct (%) | 0.9% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 208.3 KiB |
Emie_dissertation_cleansed.txt | 113 |
---|---|
OCR_PAPER_Hong et al. - 2022 - CogVideo Large-scale Pretraining for Text-to-Video Generation via Transformers-annotated_.txt | 111 |
OCR_ML4HLecture02image_.txt | 110 |
OCR_PAPER_Kandpal, Nieto, Jin - 2022 - Music Enhancement via Image Translation and Vocoding-annotated_.txt | 110 |
script_findingnemo.txt | 109 |
Other values (14) |
Length
Max length | 124 |
---|---|
Median length | 44 |
Mean length | 47.31816 |
Min length | 22 |
Characters and Unicode
Total characters | 96671 |
---|---|
Distinct characters | 57 |
Distinct categories | 7 ? |
Distinct scripts | 2 ? |
Distinct blocks | 1 ? |
Unique
Unique | 0 ? |
---|---|
Unique (%) | 0.0% |
Sample
1st row | ASR-whisper-rpunctuated_Noam Chomsky, Fundam_1669853561_0_part1.txt |
---|---|
2nd row | ASR-whisper-rpunctuated_Noam Chomsky, Fundam_1669853631_0_part2.txt |
3rd row | ASRnlp_law_lecture_week_1_v_2_c_transcription_1.txt |
4th row | ASRnlp_law_lecture_week_2_v_2_c_transcription_2.txt |
5th row | ASRnlp_law_lecture_week_3_part_1_v_2_c_transcription_3.txt |
Common Values
Value | Count | Frequency (%) |
Emie_dissertation_cleansed.txt | 113 | 5.5% |
OCR_PAPER_Hong et al. - 2022 - CogVideo Large-scale Pretraining for Text-to-Video Generation via Transformers-annotated_.txt | 111 | 5.4% |
OCR_ML4HLecture02image_.txt | 110 | 5.4% |
OCR_PAPER_Kandpal, Nieto, Jin - 2022 - Music Enhancement via Image Translation and Vocoding-annotated_.txt | 110 | 5.4% |
script_findingnemo.txt | 109 | 5.3% |
OCR_ML4HLecture04RepresentationLearning.pptx_.txt | 109 | 5.3% |
OCR_ML4HLecture05-NLP.pptx_.txt | 109 | 5.3% |
script_frozendisney.txt | 109 | 5.3% |
The Most Dangerous Game--Richard Connell.txt | 109 | 5.3% |
ASRnlp_law_lecture_week_3_part_1_v_2_c_transcription_3.txt | 108 | 5.3% |
Other values (9) | 946 |
Length
Value | Count | Frequency (%) |
442 | 7.4% | |
2022 | 221 | 3.7% |
via | 221 | 3.7% |
chomsky | 212 | 3.6% |
asr-whisper-rpunctuated_noam | 212 | 3.6% |
emie_dissertation_cleansed.txt | 113 | 1.9% |
generation | 111 | 1.9% |
ocr_paper_hong | 111 | 1.9% |
transformers-annotated_.txt | 111 | 1.9% |
text-to-video | 111 | 1.9% |
Other values (38) | 4098 |
Most occurring characters
Value | Count | Frequency (%) |
t | 9700 | 10.0% |
e | 6951 | 7.2% |
_ | 6664 | 6.9% |
n | 5853 | 6.1% |
a | 5710 | 5.9% |
r | 4755 | 4.9% |
i | 3922 | 4.1% |
3920 | 4.1% | |
o | 3477 | 3.6% |
s | 3433 | 3.6% |
Other values (47) | 42286 |
Most occurring categories
Value | Count | Frequency (%) |
Lowercase Letter | 65204 | |
Uppercase Letter | 10327 | 10.7% |
Connector Punctuation | 6664 | 6.9% |
Decimal Number | 5585 | 5.8% |
Space Separator | 3920 | 4.1% |
Other Punctuation | 2909 | 3.0% |
Dash Punctuation | 2062 | 2.1% |
Most frequent character per category
Lowercase Letter
Value | Count | Frequency (%) |
t | 9700 | |
e | 6951 | |
n | 5853 | 9.0% |
a | 5710 | 8.8% |
r | 4755 | 7.3% |
i | 3922 | 6.0% |
o | 3477 | 5.3% |
s | 3433 | 5.3% |
p | 3097 | 4.7% |
c | 2690 | 4.1% |
Other values (14) | 15616 |
Uppercase Letter
Value | Count | Frequency (%) |
R | 1730 | |
C | 1086 | |
L | 985 | |
P | 872 | |
A | 858 | |
O | 654 | 6.3% |
E | 549 | 5.3% |
M | 547 | 5.3% |
S | 532 | 5.2% |
T | 441 | 4.3% |
Other values (9) | 2073 |
Decimal Number
Value | Count | Frequency (%) |
2 | 1515 | |
1 | 851 | |
0 | 761 | |
6 | 636 | |
3 | 535 | 9.6% |
4 | 437 | 7.8% |
5 | 426 | 7.6% |
9 | 212 | 3.8% |
8 | 212 | 3.8% |
Other Punctuation
Value | Count | Frequency (%) |
. | 2477 | |
, | 432 | 14.9% |
Connector Punctuation
Value | Count | Frequency (%) |
_ | 6664 |
Space Separator
Value | Count | Frequency (%) |
3920 |
Dash Punctuation
Value | Count | Frequency (%) |
- | 2062 |
Most occurring scripts
Value | Count | Frequency (%) |
Latin | 75531 | |
Common | 21140 | 21.9% |
Most frequent character per script
Latin
Value | Count | Frequency (%) |
t | 9700 | 12.8% |
e | 6951 | 9.2% |
n | 5853 | 7.7% |
a | 5710 | 7.6% |
r | 4755 | 6.3% |
i | 3922 | 5.2% |
o | 3477 | 4.6% |
s | 3433 | 4.5% |
p | 3097 | 4.1% |
c | 2690 | 3.6% |
Other values (33) | 25943 |
Common
Value | Count | Frequency (%) |
_ | 6664 | |
3920 | ||
. | 2477 | 11.7% |
- | 2062 | 9.8% |
2 | 1515 | 7.2% |
1 | 851 | 4.0% |
0 | 761 | 3.6% |
6 | 636 | 3.0% |
3 | 535 | 2.5% |
4 | 437 | 2.1% |
Other values (4) | 1282 | 6.1% |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 96671 |
Most frequent character per block
ASCII
Value | Count | Frequency (%) |
t | 9700 | 10.0% |
e | 6951 | 7.2% |
_ | 6664 | 6.9% |
n | 5853 | 6.1% |
a | 5710 | 5.9% |
r | 4755 | 4.9% |
i | 3922 | 4.1% |
3920 | 4.1% | |
o | 3477 | 3.6% |
s | 3433 | 3.6% |
Other values (47) | 42286 |
source_doc_id
Categorical
Distinct | 19 |
---|---|
Distinct (%) | 0.9% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 137.8 KiB |
7a72cd85-984 | 113 |
---|---|
66f03e4f-bd9 | 111 |
67f6cc9a-83c | 110 |
110b05be-f8d | 110 |
04a90337-527 | 109 |
Other values (14) |
Length
Max length | 12 |
---|---|
Median length | 12 |
Mean length | 12 |
Min length | 12 |
Characters and Unicode
Total characters | 24516 |
---|---|
Distinct characters | 17 |
Distinct categories | 3 ? |
Distinct scripts | 2 ? |
Distinct blocks | 1 ? |
Unique
Unique | 0 ? |
---|---|
Unique (%) | 0.0% |
Sample
1st row | fed834b5-a04 |
---|---|
2nd row | aa279e3b-2d1 |
3rd row | 5e311e20-4bb |
4th row | 016e8d29-288 |
5th row | 07af2cf9-15a |
Common Values
Value | Count | Frequency (%) |
7a72cd85-984 | 113 | 5.5% |
66f03e4f-bd9 | 111 | 5.4% |
67f6cc9a-83c | 110 | 5.4% |
110b05be-f8d | 110 | 5.4% |
04a90337-527 | 109 | 5.3% |
65105d7b-502 | 109 | 5.3% |
adc6e224-1ea | 109 | 5.3% |
0abeb1f8-b6c | 109 | 5.3% |
af2b1960-5ca | 109 | 5.3% |
07af2cf9-15a | 108 | 5.3% |
Other values (9) | 946 |
Length
Value | Count | Frequency (%) |
7a72cd85-984 | 113 | 5.5% |
66f03e4f-bd9 | 111 | 5.4% |
67f6cc9a-83c | 110 | 5.4% |
110b05be-f8d | 110 | 5.4% |
04a90337-527 | 109 | 5.3% |
65105d7b-502 | 109 | 5.3% |
adc6e224-1ea | 109 | 5.3% |
0abeb1f8-b6c | 109 | 5.3% |
af2b1960-5ca | 109 | 5.3% |
3210a55b-6fd | 108 | 5.3% |
Other values (9) | 946 |
Most occurring characters
Value | Count | Frequency (%) |
- | 2043 | 8.3% |
a | 1928 | 7.9% |
e | 1913 | 7.8% |
d | 1700 | 6.9% |
2 | 1615 | 6.6% |
0 | 1518 | 6.2% |
b | 1515 | 6.2% |
1 | 1403 | 5.7% |
6 | 1400 | 5.7% |
5 | 1304 | 5.3% |
Other values (7) | 8177 |
Most occurring categories
Value | Count | Frequency (%) |
Decimal Number | 13036 | |
Lowercase Letter | 9437 | |
Dash Punctuation | 2043 | 8.3% |
Most frequent character per category
Decimal Number
Value | Count | Frequency (%) |
2 | 1615 | |
0 | 1518 | |
1 | 1403 | |
6 | 1400 | |
5 | 1304 | |
8 | 1278 | |
4 | 1278 | |
9 | 1181 | |
3 | 1076 | |
7 | 983 |
Lowercase Letter
Value | Count | Frequency (%) |
a | 1928 | |
e | 1913 | |
d | 1700 | |
b | 1515 | |
f | 1299 | |
c | 1082 |
Dash Punctuation
Value | Count | Frequency (%) |
- | 2043 |
Most occurring scripts
Value | Count | Frequency (%) |
Common | 15079 | |
Latin | 9437 |
Most frequent character per script
Common
Value | Count | Frequency (%) |
- | 2043 | |
2 | 1615 | |
0 | 1518 | |
1 | 1403 | |
6 | 1400 | |
5 | 1304 | |
8 | 1278 | |
4 | 1278 | |
9 | 1181 | |
3 | 1076 |
Latin
Value | Count | Frequency (%) |
a | 1928 | |
e | 1913 | |
d | 1700 | |
b | 1515 | |
f | 1299 | |
c | 1082 |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 24516 |
Most frequent character per block
ASCII
Value | Count | Frequency (%) |
- | 2043 | 8.3% |
a | 1928 | 7.9% |
e | 1913 | 7.8% |
d | 1700 | 6.9% |
2 | 1615 | 6.6% |
0 | 1518 | 6.2% |
b | 1515 | 6.2% |
1 | 1403 | 5.7% |
6 | 1400 | 5.7% |
5 | 1304 | 5.3% |
Other values (7) | 8177 |
source_doc_domain
Categorical
Distinct | 9 |
---|---|
Distinct (%) | 0.4% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 131.2 KiB |
Script | |
---|---|
OCR | |
OCR_academic_paper | |
ASR | |
ASR_cleaned | |
Other values (4) |
Length
Max length | 18 |
---|---|
Median length | 14 |
Mean length | 8.6975037 |
Min length | 3 |
Characters and Unicode
Total characters | 17769 |
---|---|
Distinct characters | 21 |
Distinct categories | 3 ? |
Distinct scripts | 2 ? |
Distinct blocks | 1 ? |
Unique
Unique | 0 ? |
---|---|
Unique (%) | 0.0% |
Sample
1st row | ASR_cleaned |
---|---|
2nd row | ASR_cleaned |
3rd row | ASR |
4th row | ASR |
5th row | ASR |
Common Values
Value | Count | Frequency (%) |
Script | 428 | |
OCR | 328 | |
OCR_academic_paper | 326 | |
ASR | 320 | |
ASR_cleaned | 212 | |
academic_paper | 113 | 5.5% |
literature | 109 | 5.3% |
conversation | 108 | 5.3% |
adversarial | 99 | 4.8% |
Length
Common Values (Plot)
Value | Count | Frequency (%) |
script | 428 | |
ocr | 328 | |
ocr_academic_paper | 326 | |
asr | 320 | |
asr_cleaned | 212 | |
academic_paper | 113 | 5.5% |
literature | 109 | 5.3% |
conversation | 108 | 5.3% |
adversarial | 99 | 4.8% |
Most occurring characters
Value | Count | Frequency (%) |
a | 2043 | |
e | 1727 | 9.7% |
c | 1626 | 9.2% |
r | 1391 | 7.8% |
p | 1306 | 7.3% |
R | 1186 | 6.7% |
i | 1183 | 6.7% |
_ | 977 | 5.5% |
S | 960 | 5.4% |
t | 754 | 4.2% |
Other values (11) | 4616 |
Most occurring categories
Value | Count | Frequency (%) |
Lowercase Letter | 12806 | |
Uppercase Letter | 3986 | 22.4% |
Connector Punctuation | 977 | 5.5% |
Most frequent character per category
Lowercase Letter
Value | Count | Frequency (%) |
a | 2043 | |
e | 1727 | |
c | 1626 | |
r | 1391 | |
p | 1306 | |
i | 1183 | |
t | 754 | 5.9% |
d | 750 | 5.9% |
m | 439 | 3.4% |
n | 428 | 3.3% |
Other values (5) | 1159 |
Uppercase Letter
Value | Count | Frequency (%) |
R | 1186 | |
S | 960 | |
C | 654 | |
O | 654 | |
A | 532 |
Connector Punctuation
Value | Count | Frequency (%) |
_ | 977 |
Most occurring scripts
Value | Count | Frequency (%) |
Latin | 16792 | |
Common | 977 | 5.5% |
Most frequent character per script
Latin
Value | Count | Frequency (%) |
a | 2043 | |
e | 1727 | |
c | 1626 | |
r | 1391 | 8.3% |
p | 1306 | 7.8% |
R | 1186 | 7.1% |
i | 1183 | 7.0% |
S | 960 | 5.7% |
t | 754 | 4.5% |
d | 750 | 4.5% |
Other values (10) | 3866 |
Common
Value | Count | Frequency (%) |
_ | 977 |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 17769 |
Most frequent character per block
ASCII
Value | Count | Frequency (%) |
a | 2043 | |
e | 1727 | 9.7% |
c | 1626 | 9.2% |
r | 1391 | 7.8% |
p | 1306 | 7.3% |
R | 1186 | 6.7% |
i | 1183 | 6.7% |
_ | 977 | 5.5% |
S | 960 | 5.4% |
t | 754 | 4.2% |
Other values (11) | 4616 |
GAUNTLET_PATH | file_name | summary | min_length | max_length | no_repeat_ngram_size | encoder_no_repeat_ngram_size | repetition_penalty | num_beams | num_beam_groups | length_penalty | early_stopping | do_sample | model_name | date | length | format | extractiveness | temperature | token_batch_length | penalty_alpha | top_k | batch_stride | max_len_ratio | directory-topic-tag | runtime | source_doc_filename | source_doc_id | source_doc_domain | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | SHORT-CONTEXT-MODELS/flan-t5-3b-summarizer/beam-search-8192-nb4/ASR-whisper-rpunctuated_Noam Chomsky, Fundam_1669853561_0_part1_summary.txt | ASR-whisper-rpunctuated_Noam Chomsky, Fundam_1669853561_0_part1_summary.txt | There's lots of interesting things to say about language, but I don't think it's as simple as you think.\nThere's lots of interesting things about language, but if you really want to understand it, you need to look at the whole picture.\n\n--- | 8.0 | 2048.0 | 3.0 | 4.0 | 2.5 | 4.0 | 1.0 | 0.8 | True | False | jordiclive/flan-t5-3b-summarizer | 20230524_080731 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ASR-whisper-rpunctuated_Noam Chomsky, Fundam_1669853561_0_part1.txt | fed834b5-a04 | ASR_cleaned |
1 | SHORT-CONTEXT-MODELS/flan-t5-3b-summarizer/beam-search-8192-nb4/ASR-whisper-rpunctuated_Noam Chomsky, Fundam_1669853631_0_part2_summary.txt | ASR-whisper-rpunctuated_Noam Chomsky, Fundam_1669853631_0_part2_summary.txt | There's no such thing as a simple language.\nThere's more than one way to solve ATB.\nI think you're asking the wrong question.\n\n--- | 8.0 | 2048.0 | 3.0 | 4.0 | 2.5 | 4.0 | 1.0 | 0.8 | True | False | jordiclive/flan-t5-3b-summarizer | 20230524_080731 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ASR-whisper-rpunctuated_Noam Chomsky, Fundam_1669853631_0_part2.txt | aa279e3b-2d1 | ASR_cleaned |
2 | SHORT-CONTEXT-MODELS/flan-t5-3b-summarizer/beam-search-8192-nb4/ASRnlp_law_lecture_week_1_v_2_c_transcription_1_summary.txt | ASRnlp_law_lecture_week_1_v_2_c_transcription_1_summary.txt | if you don't want to read the whole thing, just skip it.\nI'm sorry for the wall of text.\n\n--- | 8.0 | 2048.0 | 3.0 | 4.0 | 2.5 | 4.0 | 1.0 | 0.8 | True | False | jordiclive/flan-t5-3b-summarizer | 20230524_080731 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ASRnlp_law_lecture_week_1_v_2_c_transcription_1.txt | 5e311e20-4bb | ASR |
3 | SHORT-CONTEXT-MODELS/flan-t5-3b-summarizer/beam-search-8192-nb4/ASRnlp_law_lecture_week_2_v_2_c_transcription_2_summary.txt | ASRnlp_law_lecture_week_2_v_2_c_transcription_2_summary.txt | I think it's okay to ask questions about what you want to do in the course.\nI'm a bit of a nerd.\n\n--- | 8.0 | 2048.0 | 3.0 | 4.0 | 2.5 | 4.0 | 1.0 | 0.8 | True | False | jordiclive/flan-t5-3b-summarizer | 20230524_080731 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ASRnlp_law_lecture_week_2_v_2_c_transcription_2.txt | 016e8d29-288 | ASR |
4 | SHORT-CONTEXT-MODELS/flan-t5-3b-summarizer/beam-search-8192-nb4/ASRnlp_law_lecture_week_3_part_1_v_2_c_transcription_3_summary.txt | ASRnlp_law_lecture_week_3_part_1_v_2_c_transcription_3_summary.txt | I'm not sure if this is the right subreddit to post this.\nI'm going to finish it in person.\n\n--- | 8.0 | 2048.0 | 3.0 | 4.0 | 2.5 | 4.0 | 1.0 | 0.8 | True | False | jordiclive/flan-t5-3b-summarizer | 20230524_080731 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ASRnlp_law_lecture_week_3_part_1_v_2_c_transcription_3.txt | 07af2cf9-15a | ASR |
5 | SHORT-CONTEXT-MODELS/flan-t5-3b-summarizer/beam-search-8192-nb4/Emie_dissertation_cleansed_summary.txt | Emie_dissertation_cleansed_summary.txt | I'm writing a dissertation on Act of Violence (Fred Zinnemann), The Man Between (Claudie Reed), and the Theory of Film (Walter Kracauer).\nFilm noir is a genre of film that seeks to capture the material world as it emerges from its historical and cultural contexts.\nThe Man Between and Act of Violence are both films by the German-born, British-born filmmaker.\n\n--- | 8.0 | 2048.0 | 3.0 | 4.0 | 2.5 | 4.0 | 1.0 | 0.8 | True | False | jordiclive/flan-t5-3b-summarizer | 20230524_080731 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Emie_dissertation_cleansed.txt | 7a72cd85-984 | academic_paper |
6 | SHORT-CONTEXT-MODELS/flan-t5-3b-summarizer/beam-search-8192-nb4/OCR_ML4HLecture02image__summary.txt | OCR_ML4HLecture02image__summary.txt | Ezurich's work on medical image analysis.\n\n--- | 8.0 | 2048.0 | 3.0 | 4.0 | 2.5 | 4.0 | 1.0 | 0.8 | True | False | jordiclive/flan-t5-3b-summarizer | 20230524_080731 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | OCR_ML4HLecture02image_.txt | 67f6cc9a-83c | OCR |
7 | SHORT-CONTEXT-MODELS/flan-t5-3b-summarizer/beam-search-8192-nb4/OCR_ML4HLecture04RepresentationLearning.pptx__summary.txt | OCR_ML4HLecture04RepresentationLearning.pptx__summary.txt | Unsupervised representation learning on medical time series\nWe propose a novel framework for learning representations from time series and apply it to health state data.\n\n--- | 8.0 | 2048.0 | 3.0 | 4.0 | 2.5 | 4.0 | 1.0 | 0.8 | True | False | jordiclive/flan-t5-3b-summarizer | 20230524_080731 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | OCR_ML4HLecture04RepresentationLearning.pptx_.txt | 65105d7b-502 | OCR |
8 | SHORT-CONTEXT-MODELS/flan-t5-3b-summarizer/beam-search-8192-nb4/OCR_ML4HLecture05-NLP.pptx__summary.txt | OCR_ML4HLecture05-NLP.pptx__summary.txt | We use a combination of HMMs, neural nets, and other methods to find the most probable sequence of words in a text.\nEzurich is a language model that computes the probabilistic representation of w_1, W_n for any word, _W_n=V (vocalbulary) for any sentence.\n\n--- | 8.0 | 2048.0 | 3.0 | 4.0 | 2.5 | 4.0 | 1.0 | 0.8 | True | False | jordiclive/flan-t5-3b-summarizer | 20230524_080731 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | OCR_ML4HLecture05-NLP.pptx_.txt | adc6e224-1ea | OCR |
9 | SHORT-CONTEXT-MODELS/flan-t5-3b-summarizer/beam-search-8192-nb4/OCR_PAPER_Hong et al. - 2022 - CogVideo Large-scale Pretraining for Text-to-Video Generation via Transformers-annotated__summary.txt | OCR_PAPER_Hong et al. - 2022 - CogVideo Large-scale Pretraining for Text-to-Video Generation via Transformers-annotated__summary.txt | We propose CogVideo to be the largest and first open source pretrained Transformer for Text-To-Video generation in general.\nWe introduce a human evaluation for CogVideo and show the results.\n\n--- | 8.0 | 2048.0 | 3.0 | 4.0 | 2.5 | 4.0 | 1.0 | 0.8 | True | False | jordiclive/flan-t5-3b-summarizer | 20230524_080731 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | OCR_PAPER_Hong et al. - 2022 - CogVideo Large-scale Pretraining for Text-to-Video Generation via Transformers-annotated_.txt | 66f03e4f-bd9 | OCR_academic_paper |
GAUNTLET_PATH | file_name | summary | min_length | max_length | no_repeat_ngram_size | encoder_no_repeat_ngram_size | repetition_penalty | num_beams | num_beam_groups | length_penalty | early_stopping | do_sample | model_name | date | length | format | extractiveness | temperature | token_batch_length | penalty_alpha | top_k | batch_stride | max_len_ratio | directory-topic-tag | runtime | source_doc_filename | source_doc_id | source_doc_domain | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2033 | OpenAI/gpt-3.5-turbo/map_reduce_output/batched-summaries/OCR_PAPER_Hong et al. - 2022 - CogVideo Large-scale Pretraining for Text-to-Video Generation via Transformers-annotated__summary.txt | OCR_PAPER_Hong et al. - 2022 - CogVideo Large-scale Pretraining for Text-to-Video Generation via Transformers-annotated__summary.txt | Researchers from Tsinghua University have developed Cog Video, a large-scale pre-trained transformer for text-to-video generation. The model, which has 9.4 billion parameters and was trained on 5.4 million text-video pairs, outperforms all publicly available models in both machine and human evaluations. The team used a multi-frame-rate hierarchical training strategy to better align text and video clips, and inherited knowledge from a pre-trained text-to-image model. Cog Video is the largest and first open-source pre-trained transformer for text-to-video generation in the general domain.\n\tThe paper proposes a dual-channel attention technique for pretraining a text-to-video generation model using pretrained image generation models instead of image data. The proposed technique leverages the pretrained models' knowledge of text-image relations and larger dataset coverage. The paper also introduces a shifted window attention for auto-regressive generation to alleviate time and memory overhead. The model is evaluated on UCF-101 and Kinetics-600 datasets using Frechet Video Distance and Inception Score metrics and achieves better results than other baselines. Human evaluation also shows that the proposed model outperforms other baselines on multiple aspects.\n\tThe paper presents Cog' Video, a pretrained transformer for text-to-video generation in the general domain. The proposed multi-frame-rate hierarchical training framework improves the understanding of text-video relations and the ability to control the intensity of changes during generation. The paper also conducts ablation studies on Kinetics-600 and UCF-101 datasets to verify the effectiveness of hierarchical multi-frame-rate generation and incorporating Cog View2. The results show that the hierarchical method outperforms the 1-stage model on semantic relevance, motion realism, and texture quality. The paper aims to advance open-domain text-to-video generation, which will ease the effort of short video and digital art creation.\n\tThe article discusses the attention mechanism of dual-channel attention in the Cog Video model, which consists of two stages for sequential generation and recursive interpolation. The model is trained on a dataset of 5.4 million captioned videos and has 9.4 billion parameters. The article also provides details about the human evaluation process used to measure generation quality, which includes asking evaluators to give scores for frame texture, motion realism, and semantic relevance. The results of the evaluation show that Cog Video outperforms other models in terms of quality. | NaN | 512.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | gpt-3.5-turbo | NaN | NaN | NaN | NaN | NaN | 3584.0 | NaN | NaN | 0.0 | NaN | NaN | NaN | OCR_PAPER_Hong et al. - 2022 - CogVideo Large-scale Pretraining for Text-to-Video Generation via Transformers-annotated_.txt | 66f03e4f-bd9 | OCR_academic_paper |
2034 | OpenAI/gpt-3.5-turbo/map_reduce_output/batched-summaries/OCR_PAPER_Kandpal, Nieto, Jin - 2022 - Music Enhancement via Image Translation and Vocoding-annotated__summary.txt | OCR_PAPER_Kandpal, Nieto, Jin - 2022 - Music Enhancement via Image Translation and Vocoding-annotated__summary.txt | The paper presents a deep learning approach to enhance low-quality music recordings by combining an image-to-image translation model for manipulating audio in its mel-spectrogram representation and a music vocoding model for mapping synthetically generated mel-spectrograms to perceptually realistic waveforms. The approach outperforms baselines which use classical methods for mel-spectrogram inversion and an end-to-end approach directly mapping noisy waveforms to clean waveforms. The paper also analyzes the reliability of common audio enhancement evaluation metrics when used in the music domain. The authors hope to motivate future research in music enhancement and music quality perceptual metrics akin to those in the speech literature.\n\tThe paper proposes a music enhancement model that decomposes the task into mel-spectrogram enhancement and waveform synthesis from mel-spectrograms. The model was trained using high-quality samples from a public dataset paired with low-quality samples generated by simulating artifacts that typically appear in amateur recordings. A human MOS test shows that this model outperforms state-of-the-art baselines. Additionally, the paper finds that current objective metrics for audio enhancement do not accurately reflect human perception of music.\n\tThe references cited in this document cover various topics related to audio processing, including waveform synthesis, music source separation, speech enhancement, and noise suppression. The references include studies on generative adversarial networks, deep learning models, and objective quality measures for evaluating audio processing algorithms. Techniques such as instance normalization, parallel wavegan, and conditional generative adversarial networks are also discussed. | NaN | 512.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | gpt-3.5-turbo | NaN | NaN | NaN | NaN | NaN | 3584.0 | NaN | NaN | 0.0 | NaN | NaN | NaN | OCR_PAPER_Kandpal, Nieto, Jin - 2022 - Music Enhancement via Image Translation and Vocoding-annotated_.txt | 110b05be-f8d | OCR_academic_paper |
2035 | OpenAI/gpt-3.5-turbo/map_reduce_output/batched-summaries/OCR_PAPER_dall-e-2-annotated__summary.txt | OCR_PAPER_dall-e-2-annotated__summary.txt | The paper proposes a two-stage model for text-conditional image generation using CLIP embeddings. The first stage generates a CLIP image embedding given a text caption, and the second stage generates an image conditioned on the image embedding. The model can produce variations of an image that preserve both its semantics and style, while varying non-essential details. The joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. The model uses diffusion models for the decoder and experiments with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples. The paper also describes three different kinds of manipulations enabled by the bipartite representation of images.\n\tThe article discusses a new text-to-image generation model called unCLIP, which uses a combination of a diffusion prior and a decoder to generate realistic images from text prompts. The model is evaluated on various benchmarks, including MS-COCO, and is found to outperform other state-of-the-art models in terms of diversity and photorealism. The article also explores the CLIP latent space and the importance of the prior in generating high-quality images. Finally, the article presents automated aesthetic quality evaluations comparing unCLIP to other models.\n\tThe paper discusses the use of CLIP-guided diffusion models for text-conditional image generation. The authors compare their model, unCLIP, to the previously proposed GLIDE model and find that both benefit from guidance, but unCLIP does not sacrifice recall for aesthetic quality. The paper also discusses previous works in synthetic image generation and the limitations and risks associated with these models. The authors acknowledge the need for further research on the risks and biases associated with these models.\n\tThe article provides a list of research papers related to text-to-image generation using deep learning techniques. The papers cover various approaches such as generative adversarial networks, diffusion models, and CLIP-guided models. The papers also explore different aspects of the problem, including domain adaptation, multimodal learning, and transfer learning. The article highlights the importance of contrastive learning and attention mechanisms in improving the quality of generated images.\n\tThe article describes the use of linear probes and logistic regression models to automate aesthetic quality evaluations of images. The models were trained on the AVA dataset and pairwise image comparisons gathered from previous human evaluations. The article also provides details on the hyperparameters used to train the models, including the use of CLIP and DALL-E datasets, and the GLIDE model for the decoder architecture. Random samples from the production model for various prompts are also shown. | NaN | 512.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | gpt-3.5-turbo | NaN | NaN | NaN | NaN | NaN | 3584.0 | NaN | NaN | 0.0 | NaN | NaN | NaN | OCR_PAPER_dall-e-2-annotated_.txt | 3f42d484-d96 | OCR_academic_paper |
2036 | OpenAI/gpt-3.5-turbo/map_reduce_output/batched-summaries/The Most Dangerous Game--Richard Connell_summary.txt | The Most Dangerous Game--Richard Connell_summary.txt | "The Most Dangerous Game" by Richard Connell is a story about a big-game hunter named Sanger Rainsford who falls off a yacht and ends up on a mysterious island called Ship-Trap Island. The island has a reputation for being dangerous, and Rainsford soon discovers why when he meets General Zaroff, a fellow hunter who has a twisted idea of what constitutes a good hunt. Zaroff hunts humans for sport, and he has set his sights on Rainsford as his next prey. Rainsford must use all his skills as a hunter to survive the deadly game that Zaroff has set up for him.\n\tGeneral Zaroff, a passionate hunter, invites Rainsford to his island where he reveals that he has invented a new sensation in hunting. He hunts humans, whom he considers the most dangerous game, and has a training school in his cellar for his prey. Rainsford is horrified and refuses to participate, but the general insists that life is for the strong and that weak men are put on earth to give the strong pleasure. Rainsford hears a gunshot in the jungle and realizes that he may be the next prey.\n\tRainsford, a big-game hunter, finds himself stranded on an island where he is hunted by General Zaroff, a fellow hunter who has grown bored with hunting animals and now hunts humans. Rainsford manages to evade the general for a while, but eventually, he is forced to face him in a deadly game of cat and mouse. In the end, Rainsford manages to outsmart the general and escape the island.\n\tRainsford escapes from General Zaroff's hunting game by jumping into the sea. The general enjoys a good dinner but is annoyed that Rainsford escaped and that he will have to replace his assistant Ivan. Later, Rainsford surprises the general in his bedroom and challenges him to a final hunt. The story ends with the implication that Rainsford has won the game. | NaN | 512.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | gpt-3.5-turbo | NaN | NaN | NaN | NaN | NaN | 3584.0 | NaN | NaN | 0.0 | NaN | NaN | NaN | The Most Dangerous Game--Richard Connell.txt | af2b1960-5ca | literature |
2037 | OpenAI/gpt-3.5-turbo/map_reduce_output/batched-summaries/gpt_peter_testing_group_exemplars_summary.txt | gpt_peter_testing_group_exemplars_summary.txt | The text contains a random assortment of questions, statements, and requests, ranging from discussions about Korea, fears, psychotic episodes, covert operations, and consciousness to more mundane topics like food, music, and hobbies. There are also some nonsensical or humorous comments and requests, such as fitting soybeans in foreskin, creating a cryptocurrency project, and asking about the funniest joke. The text lacks a clear theme or purpose.\n\tThe text is a collection of random and unrelated statements and questions, ranging from philosophical musings to personal anecdotes and recommendations for music and movies. There is no clear theme or narrative. | NaN | 512.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | gpt-3.5-turbo | NaN | NaN | NaN | NaN | NaN | 3584.0 | NaN | NaN | 0.0 | NaN | NaN | NaN | gpt_peter_testing_group_exemplars.txt | 3210a55b-6fd | conversation |
2038 | OpenAI/gpt-3.5-turbo/map_reduce_output/batched-summaries/navy seals copy pasta_summary.txt | navy seals copy pasta_summary.txt | A person threatens someone who insulted them online, claiming to be a highly trained Navy SEAL with access to a network of spies and the entire arsenal of the US Marine Corps. They vow to kill the person in over 700 ways and make them suffer for their comment. | NaN | 512.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | gpt-3.5-turbo | NaN | NaN | NaN | NaN | NaN | 3584.0 | NaN | NaN | 0.0 | NaN | NaN | NaN | navy seals copy pasta.txt | 6adec8a8-d94 | adversarial |
2039 | OpenAI/gpt-3.5-turbo/map_reduce_output/batched-summaries/script_findingnemo_summary.txt | script_findingnemo_summary.txt | This is a work-in-progress transcript of the movie Finding Nemo. It is not 100% accurate and may have missing or incorrect words. The transcript is open for corrections and additions, but cannot be edited and credited by others. The transcript starts with Marlin and Coral admiring their new home and discussing their upcoming parenthood. The story then follows Nemo's first day of school and his adventures with his classmates. The transcript ends with the group encountering a "butt" and making Pearl ink.\n\tMarlin, a clownfish, becomes overprotective of his son Nemo after his wife and other children are killed in a barracuda attack. When Nemo is captured by a diver, Marlin sets out to rescue him, encountering a forgetful fish named Dory and a group of sharks along the way. Meanwhile, Nemo is taken to a dentist's office in Sydney, where he meets other aquarium fish and plans his escape.\n\tNemo, a young clownfish, is taken from the ocean and placed in a fish tank in a dentist's office. He meets a group of fish who plan to escape the tank and return to the ocean. They recruit Nemo to help them by jamming the tank's filter. The plan is successful, and the fish escape into the harbor.\n\tNemo, a young fish, is encouraged by his new friends to escape from a fish tank in a dentist's office and find his way back to the ocean to reunite with his father. Meanwhile, Marlin, Nemo's father, is also on a journey to find his son and meets a forgetful fish named Dory who helps him along the way.\n\tThe fish characters panic as Nemo gets stuck in a filter, but they manage to rescue him. Meanwhile, Marlin and Dory ride the East Australian Current with the help of sea turtles and eventually reach Sydney. Crush, a sea turtle, gives them a proper exiting technique before they continue their journey.\n\tMarlin and Dory are searching for Marlin's son, Nemo, and end up inside a whale. They eventually escape and continue their search, while Nemo and his fish tank friends plan their escape from the dentist's office. Meanwhile, a pelican named Nigel and his friends observe the chaos.\n\tMarlin, a clownfish, sets out to find his son Nemo who has been taken by a diver. Along the way, he meets Dory, a forgetful fish, and together they encounter various obstacles and make new friends. Eventually, they find Nemo and bring him back home.\n\tThe transcript contains dialogue from the movie "Finding Nemo" where the characters say goodbye to each other and a scene where the fish in a dentist's office try to escape. The transcript is provided for fans' enjoyment and educational purposes only, and no copyright infringement is intended. | NaN | 512.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | gpt-3.5-turbo | NaN | NaN | NaN | NaN | NaN | 3584.0 | NaN | NaN | 0.0 | NaN | NaN | NaN | script_findingnemo.txt | 04a90337-527 | Script |
2040 | OpenAI/gpt-3.5-turbo/map_reduce_output/batched-summaries/script_frozendisney_summary.txt | script_frozendisney_summary.txt | The opening scene of "Frozen" shows ice harvesters singing and cutting ice blocks. The story then follows two sisters, Elsa and Anna, as they grow up in a kingdom where Elsa has magical powers. After an accident involving Anna, their parents decide to keep Elsa's powers hidden and limit her contact with people, including Anna. As they grow up, Anna tries to reconnect with Elsa, but Elsa struggles to control her powers and keep them hidden.\n\tAnna is bored and watches the clock tick by while Elsa panics about her growing powers. The King and Queen leave on a ship and are lost at sea, leaving Anna and Elsa alone. Years later, it's Coronation Day and Anna is excited while Elsa is nervous. Anna meets Prince Hans and they have an awkward encounter. The bells ring for the coronation.\n\tHans and Anna attend Elsa's coronation, but Elsa is nervous and hesitant. During the celebration, Anna and Hans fall in love and decide to get married, but Elsa refuses to give her blessing. In a heated argument, Elsa accidentally reveals her ice powers to everyone and runs away, leaving Anna heartbroken and confused.\n\tElsa accidentally reveals her powers at the ball, causing chaos and prompting her to flee. Anna sets out to find her and apologize, encountering Kristoff and Oaken's Trading Post along the way. Elsa reaches a mountain top and sings "Let It Go" as she creates an ice palace. Anna eventually reaches the trading post and learns that Elsa went to the North Mountain. Kristoff agrees to help her, but they get into a dispute with Oaken over the price of supplies.\n\tKristoff and Anna find shelter in a dilapidated barn and Kristoff sings a song to Sven. Anna asks Kristoff to take her up the North Mountain to find Elsa and stop the winter. They encounter wolves and Kristoff's sled is destroyed, but they manage to escape. They continue their journey on foot and meet Olaf, a talking snowman without a nose.\n\tAnna, Kristoff, Sven, and Olaf continue their journey to find Elsa and stop the eternal winter. Olaf gets impaled by an icicle but laughs it off. They reach Elsa's ice palace, and Sven struggles to climb the stairs. Kristoff helps him while Anna and Olaf climb the stairs.\n\tAnna and Kristoff arrive at Elsa's ice palace, where Anna tries to convince Elsa to return to Arendelle and end the eternal winter she has caused. However, Elsa is afraid of hurting anyone else with her powers and creates a giant snowman, Marshmallow, to throw them out. Anna and Kristoff escape, but Marshmallow chases them and they end up hanging off a cliff. Olaf tries to help but is thrown off the cliff by Marshmallow. Eventually, Anna cuts the rope and they fall into a soft snowbank.\n\tAnna and Kristoff, along with Olaf and Sven, arrive at Kristoff's family of trolls. The trolls mistake Anna for Kristoff's fiancée and sing a song about fixing up relationships. Anna and Kristoff start to feel a spark between them, but are interrupted by the trolls trying to marry them off.\n\tAnna collapses and is found to have ice in her heart, which can only be removed by an act of true love. Kristoff and Sven bring her back to the castle, where Hans pretends to be in love with her but reveals his true intentions to kill Elsa and take over the kingdom. Hans charges Elsa with treason and sentences her to death, while Anna's condition worsens.\n\tElsa escapes from her imprisonment and creates a snowstorm that engulfs the kingdom. Anna and Olaf search for her, while Kristoff and Sven try to reach Anna. Hans confronts Elsa, but Anna sacrifices herself to save Elsa and thaws her frozen heart. Elsa realizes that love is the key to controlling her powers and uses it to end the snowstorm. Hans is arrested and taken back to his country. The kingdom is restored to its former glory.\n\tThe Duke and his thugs are escorted out of Arendelle by guards, while Anna surprises Kristoff with a new sled and makes him the official Ice Master and Deliverer. Olaf enjoys the summer and Sven helps Elsa create an ice rink for the villagers to skate on. The castle has been repaired with ice and all is well in Arendelle. | NaN | 512.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | gpt-3.5-turbo | NaN | NaN | NaN | NaN | NaN | 3584.0 | NaN | NaN | 0.0 | NaN | NaN | NaN | script_frozendisney.txt | 0abeb1f8-b6c | Script |
2041 | OpenAI/gpt-3.5-turbo/map_reduce_output/batched-summaries/script_strangersonatrain_summary.txt | script_strangersonatrain_summary.txt | The script for "Strangers on a Train" by Raymond Chandler and Czenzi Ormonde begins with two strangers, Guy Haines and Bruno Anthony, meeting on a train. Bruno is fascinated by Guy, a famous tennis player, and strikes up a conversation with him. As they talk, Bruno reveals his dark thoughts about murder and his troubled relationship with his father. Guy becomes increasingly uncomfortable and tries to change the subject, but Bruno persists in his morbid musings.\n\tBruno suggests to Guy a plan to swap murders with a stranger to get rid of their respective unwanted targets. Guy is hesitant and tries to leave, but Bruno insists on discussing the plan further. Meanwhile, Guy's ex-wife Miriam refuses to give him a divorce and threatens to ruin his reputation by having another man's baby. Guy calls his lover Anne to vent his frustration, but his anger escalates as a train passes by, and he yells that he could strangle Miriam. The scene then shifts to Bruno and his mother getting manicures.\n\tMrs. Anthony, a wealthy woman, is concerned about her son Bruno's restlessness and pale appearance. She suggests he take up painting as a soothing pastime. Bruno receives a call from Guy, and his father confronts him about his involvement in hit and run driving. Bruno goes to Metcalf to stalk Miriam, and they end up at an amusement park where he impresses her by ringing the bell on a sledgehammer game. He follows her onto a merry-go-round.\n\tBruno meets Guy at an amusement park and follows him onto a boat ride with his friends. After they exit the ride, Bruno strangles and kills Miriam, a girl he had been stalking. Later, Bruno meets Guy again and gives him Miriam's broken glasses as a "present," revealing that he was the one who killed her. Guy is horrified and calls Bruno a maniac.\n\tGuy is confronted by Bruno, who reminds him that they planned a murder together. Guy tries to leave, but Bruno warns him that they would both be arrested if he goes to the police. Meanwhile, Guy's phone rings and the police arrive at his apartment building. Bruno urges Guy to tell the police that he already knows about the murder. Later, Guy receives the news that his wife has been murdered and he becomes a suspect. He tells the Senator that he was on a train at the time of the murder and spoke to a professor named Collins. Anne comforts Guy, and they realize that Miriam's death means they are now free to be together.\n\tGuy Haines is being investigated for the murder of his estranged wife, Miriam. He meets with the police to establish his alibi and is followed by a private detective named Hennessy. Guy's girlfriend, Anne, worries about the investigation and suggests he continue with his plans to play in a tennis tournament. Meanwhile, a man named Bruno, who has a strange fixation on Guy, follows him around and tries to contact him.\n\tGuy receives a note from Bruno asking to meet and make plans, but Guy tears it up and burns it. Later, at a gallery with Anne, Bruno appears and tries to talk to Guy, causing him to become nervous. At a tennis match, Guy sees Bruno watching him. Later, at a party, Barbara introduces Bruno to the group and he becomes fixated on her. Guy receives another note from Bruno, but hides it when Hennessy arrives.\n\tGuy and Hennessy discuss Hammond taking over, while Guy retrieves a note and gun from a dresser drawer. They leave for a party at the Burton house, where Bruno unexpectedly shows up. Bruno engages in a conversation about murder with some guests, including Mrs. Cunningham, and demonstrates how to strangle someone. Barbara watches in horror as Bruno becomes transfixed and eventually faints. Bruno is taken to a study, and the Senator asks Guy to get him out of there.\n\tGuy goes to Bruno's house to carry out their plan to exchange murders, but is interrupted by the arrival of the police. He manages to escape and goes to Mr. Antony's house to warn him about Bruno's intentions. Guy enters Mr. Antony's bedroom and wakes him up to talk about Bruno, but the scene ends before any further action is taken.\n\tBruno confronts Guy in his bedroom, revealing that he knows about the murder and threatening to frame Guy by planting evidence. Anne tries to convince Bruno's mother to help clear Guy's name, but she dismisses the idea. Guy and Anne discuss their plan to retrieve Guy's lighter from the murder scene before Bruno can plant it there. Meanwhile, Guy plays a tennis match while being watched by detectives Hennessy and Hammond. Bruno leaves his home in a taxi, presumably to carry out his plan.\n\tGuy Haines is playing a tennis match while his friend Bruno is on a train to Metcalf. Anne, Guy's lover, tells him that Bruno may implicate him in the murder of his wife. Guy is worried about his cigarette lighter being found at the scene of the crime. Meanwhile, Bruno reads about Guy's arrest in the newspaper and plays with the lighter. Guy wins the tennis match and Anne tells Barbara to have a car ready. Bruno arrives in Metcalf but drops the lighter when bumped by a passerby.\n\tBruno drops his cigarette case down a drain and enlists the help of a porter and passersby to retrieve it. Meanwhile, Guy is playing a tennis match and wins a crucial game. Barbara signals to Guy that everything is set for their plan, and he leaves the match to meet her. The police are on the lookout for Guy, and Bruno begins to feel uneasy as he overhears them talking about the killer being at the amusement park. Guy arrives at the park, and the police follow him.\n\tBruno is being followed and is seen approaching a flood-lit pay-box. The boatman recognizes him and Bruno deserts the queue. The boatman informs a uniformed man who starts looking for Bruno. Bruno jumps on a merry-go-round and Guy chases after him. They fight and the merry-go-round topples over. Guy is helped to his feet and the boatman informs the police that Guy is not the man who killed his wife. Guy explains that Bruno has his cigarette lighter and wanted to plant it on the island to frame him. They find Bruno pinned under the overturned machine and he denies having the lighter. He dies shortly after.\n\tAs Bruno dies, his hand opens to reveal Guy's lighter. Turley takes the lighter and suggests they stay in town overnight to clear things up. Guy asks for a telephone and learns that Bruno's name was Bruno Antony. Later, Anne receives a call from Guy saying he'll be back tomorrow. The next day, on a train, a cleric recognizes Guy and they quickly leave. The film ends. | NaN | 512.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | gpt-3.5-turbo | NaN | NaN | NaN | NaN | NaN | 3584.0 | NaN | NaN | 0.0 | NaN | NaN | NaN | script_strangersonatrain.txt | 9e6bfae4-7c2 | Script |
2042 | OpenAI/gpt-3.5-turbo/map_reduce_output/batched-summaries/script_sunsetblvd._summary.txt | script_sunsetblvd._summary.txt | The script for the movie "Sunset Boulevard" begins with a sequence showing the street sign for Sunset Boulevard and a murder scene at a mansion. The story follows Joe Gillis, a struggling writer who meets with a producer named Sheldrake to pitch his idea for a baseball movie. Sheldrake is not impressed, and Gillis meets Betty Schaefer, a script reader who also dislikes his idea. Gillis is desperate for work and hopes Sheldrake can help him, but nothing comes of it.\n\tJoe Gillis is desperate for work and money, but his attempts to secure either are unsuccessful. He even asks his boss for a personal loan, but is denied. While driving, he is chased by finance company men and ends up hiding in the garage of a run-down mansion. He is then led into the mansion by Max von Mayerling and meets Norma Desmond, a former silent film star who is eccentric and delusional. She mistakes him for a funeral director and asks him to arrange a funeral for her dead chimpanzee. Gillis tries to explain the mistake, but Norma is not convinced.\n\tJoe Gillis enters what he thinks is an empty house but is confronted by Norma Desmond, a former silent film star. Norma insists that Joe edit her script and stay in her house. Joe agrees and is shown to a room over the garage by Norma's butler, Max. Joe observes the dilapidated state of the house and its amenities, including a tennis court and swimming pool.\n\tJoe Gillis watches a rat fight over a decaying orange at the bottom of a swimming pool while Norma Desmond and Max bury a chimp in the lawn. Gillis locks himself in a room and wakes up to find his belongings unpacked and Norma insisting he stay to work on her script. They watch old movies together, and Norma dreams of returning to stardom. Gillis kibitzes on a bridge game with Norma and her actor friends while trying to avoid men who have come to tow away his car.\n\tGillis needs money urgently and asks Norma for it, but she refuses. He goes outside and sees the finance company taking away his car. Norma offers him her expensive Isotta-Fraschini car instead. Later, Gillis is dressed up for Norma's New Year's party and they dance together. Norma confesses her love for Gillis, making him uncomfortable.\n\tNorma offers to buy Gillis extravagant gifts for the upcoming year, but he refuses. She then gives him a gold cigarette case and lighter with a personal engraving. Gillis expresses his desire to have a life of his own, causing Norma to slap him and storm off. Gillis leaves the party and goes to Artie Green's apartment, where he meets Betty Schaefer. They discuss writing, but Gillis receives a call from Max, who tells him that Norma has attempted suicide. Gillis is in shock and pushes Betty aside to leave.\n\tJoe rushes to Norma's house to check on her after she attempted suicide. Norma is still in love with Joe, but he tries to convince her to act sensibly. Later, Betty tells Joe that Sheldrake likes the idea of his script, but Joe is not interested in writing anymore. Norma tries to cheer Joe up by performing a comedic routine, but he is still preoccupied with his thoughts about Betty and the Hollywood industry.\n\tNorma Desmond receives a call from Paramount Studios, but is upset that it was not from Cecil B. DeMille himself. She goes to the studio to meet with DeMille, who apologizes for not calling her personally. Norma becomes emotional and expresses her desire to work with DeMille again. Meanwhile, Joe Gillis visits the Readers' Department and offers Betty his script, Dark Windows.\n\tBetty and Gillis discuss a story idea about teachers and their struggles. Gillis suggests a romantic plot involving two teachers sharing a room. Betty and Gillis agree to work on the story together, but Gillis is hesitant due to his busy schedule. Norma Desmond undergoes various beauty treatments and expresses her dependence on Gillis. Gillis sneaks out to work on the story with Betty at night. They take a walk down Paramount's New York street and discuss their childhoods.\n\tJoe Gillis and Betty Schaefer discuss their past experiences in the film industry. Betty comes from a family of actors and had dreams of becoming a star, but was rejected due to her acting skills. Joe and Betty grow closer, but Norma Desmond, Joe's former lover, becomes increasingly unstable and calls Betty to warn her about Joe's true character. Betty visits Joe at his home, which is actually Norma's mansion, and they discuss their feelings for each other. Meanwhile, Norma becomes increasingly desperate and reveals a hidden revolver.\n\tJoe Gillis shows Betty around Norma Desmond's mansion, revealing that Norma is an aging former movie star who lives with a companion and is jealous of Betty. Betty becomes upset and wants to leave, but Joe convinces her to stay. Norma becomes increasingly unstable and shoots Joe when he tries to leave her. The police arrive and question Norma, but she becomes fixated on the newsreel cameras and believes she is going to be on set for a film. Max, Norma's loyal servant, helps her escape the police and get to the set.\n\tNorma Desmond prepares for a scene on a staircase while Max sets up the cameras and lights. Norma descends the staircase, stopping to express her happiness to be back in the studio and promises to never desert them again. She then requests her closeup and the scene fades out. | NaN | 512.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | gpt-3.5-turbo | NaN | NaN | NaN | NaN | NaN | 3584.0 | NaN | NaN | 0.0 | NaN | NaN | NaN | script_sunsetblvd..txt | deed3ee1-dae | Script |