WorldLex: Blog, Twitter and Newspapers Word Frequencies for 64 languages

Manuel Gimenes & Boris New

All frequencies are expressed in million words. For instance the Afrikaans corpus has a total of 14.3 million words.

These corpora have been collected by Hans Christensen for HCCorpora.

Country

Collected Date

BlogsNbWords

BlogsNbDocs

TwitterNbWords

TwitterNbDocs

NewsNbWords

NewsNbDocs

Total

Corpus

Raw Frequencies

Spellchecked Frequencies

Afrikaans

2011

6.2

181,051

2.9

219,559

5.2

152,312

14.3

Corpus

Link

Link

Albanian

2012

11.8

327,840

1.3

102,386

13.1

228,553

26.2

Corpus

Link

Link

Amharic

2013

1.1

24,579

1.3

32,554

2.4

Corpus

Link

Arabic

2011-12

20.0

877,403

21.0

1,641,146

21.7

510,612

62.7

Corpus

Link

Link

Armenian

2011

6.0

243,097

8.5

156,586

14.6

Corpus

Link

Link

Azeri

2012

5.0

155,140

0.7

64,838

6.9

140,995

12.7

Corpus

Link

Link

Bengali

2012

3.0

105,696

2.9

58,998

5.8

Corpus

Link

Link

Bosnian

2013

5.6

170,333

6.0

181,370

11.7

Corpus

Link

Catalan

2013

8.2

187,262

6.5

397,410

5.0

81,893

19.7

Corpus

Link

Link

Chinese Simplified

2011

27.9

1,045,472

32.6

1,440,112

33.7

682,472

94.1

Corpus

Link

Croatian

2012

10.6

297,117

4.2

317,257

10.6

227,317

25.5

Corpus

Link

Link

Czech

2011

10.6

293,584

8.0

565,638

10.8

276,881

29.4

Corpus

Link

Link

Danish

2010-11

29.5

904,546

14.9

1,062,567

27.6

887,016

72.0

Corpus

Link

Link

Dutch

2011-12

25.6

761,163

21.8

1,671,690

13.9

313,508

61.4

Corpus

Link

Link

English US

2012

38.1

899,288

30.9

2,360,148

35.2

1,010,242

104.2

Corpus

Link

Link

Estonian

2011

13.1

409,501

4.4

388,541

11.9

422,432

29.4

Corpus

Link

Link

Finnish

2011

12.8

439,785

3.2

285,214

10.5

485,758

26.4

Corpus

Link

Link

French

2011-12

35.2

880,655

28.9

2,023,279

20.1

358,001

84.2

Corpus

Link

Link

Georgian

2011

4.7

181,499

4.8

164,614

9.5

Corpus

Link

German

2010-11

23.4

715,439

24.3

1,936,088

27.1

533,905

74.8

Corpus

Link

Link

Greek

2011-12

19.8

564,281

18.8

1,564,325

18.6

424,397

57.2

Corpus

Link

Link

Greenlandic

2012

3.7

227,073

3.7

Corpus

Link

Gujarati

2011

5.1

224,047

5.0

116,482

10.2

Corpus

Link

Link

Hebrew

2011

8.4

269,866

4.7

409,582

8.2

199,047

21.4

Corpus

Link

Link

Hindi

2011-12

6.7

280,267

7.0

134,268

13.7

Corpus

Link

Link

Hungarian

2011-13

23.3

822,669

19.8

1,819,217

23.8

548,938

66.9

Corpus

Link

Link

Icelandic

2011

8.1

234,021

2.8

230,651

5.7

144,018

16.7

Corpus

Link

Link

Indonesian

2011-12

37.9

1,645,328

39.0

3,449,770

38.3

1,144,596

115.2

Corpus

Link

Link

Italian

2011-12

26.2

839,919

23.9

1,985,519

29.5

394,465

79.5

Corpus

Link

Link

Japanese

2011

14.9

664,309

11.9

667,119

14.1

312,916

40.9

Corpus

Link

Kannada

2011

4.0

173,154

5.0

175,824

9.1

Corpus

Link

Kazakh

2012

2.1

77,940

3.5

77,799

5.6

Corpus

Link

Link

Khmer

2012

2.6

122,528

3.4

55,674

6.0

Corpus

Link

Korean

2011

17.6

923,997

18.7

1,572,766

19.4

667,314

55.7

Corpus

Link

Latvian

2012

12.5

374,913

11.3

942,301

12.4

319,428

36.3

Corpus

Link

Link

Lithuanian

2011

4.0

144,945

1.3

125,387

4.6

149,453

9.9

Corpus

Link

Link

Macedonian

2012

6.4

218,055

2.5

192,612

6.5

144,853

15.4

Corpus

Link

Link

Malayalam

2011

2.0

102,043

0.3

28,337

1.7

40,484

4.0

Corpus

Link

Link

Malaysian

2011

8.9

333,607

6.1

611,028

8.9

356,723

23.9

Corpus

Link

Mongolian

2012

4.8

156,390

5.2

108,846

10.0

Corpus

Link

Link

Nepali

2013

2.5

76,080

0.9

62,726

2.5

54,877

5.9

Corpus

Link

Link

Norwegian

2011

16.9

487,754

12.5

897,939

14.5

554,226

44.0

Corpus

Link

Link

Persian

2012

4.7

135,767

4.0

167,898

8.8

Corpus

Link

Link

Polish

2011-13

26.5

852,733

22.5

2,066,716

25.8

698,571

74.8

Corpus

Link

Link

Portuguese Brazil

2011

14.2

600,228

19.5

1,672,477

17.0

380,983

50.7

Corpus

Link

Link

Portuguese Europe

2011-12

21.5

788,683

22.3

1,799,560

24.2

606,037

68.0

Corpus

Link

Link

Punjabi

2012

14.8

372,073

12.3

940,256

15.1

306,846

42.2

Corpus

Link

Link

Romanian

2011-13

30.8

834,510

12.7

961,551

31.1

669,306

74.6

Corpus

Link

Link

Russian

2011-12

20.3

753,319

23.1

2,136,329

20.3

456,407

63.6

Corpus

Link

Link

Serbian (Latin)

2013

7.3

212,482

6.4

449,312

7.2

167,587

21.0

Corpus

Link

Link

Sinhala

2011

5.0

190,719

5.9

143,970

10.8

Corpus

Link

Link

Slovak

2011

11.2

277,600

1.4

103,163

10.0

245,660

22.7

Corpus

Link

Link

Slovenian

2012-13

14.1

342,459

6.6

517,244

14.8

255,793

35.5

Corpus

Link

Link

Spanish South America

2012

15.3

570,369

14.5

1,140,487

16.0

389,620

45.8

Corpus

Link

Link

Spanish Spain

2011-12

29.6

2,136,625

16.0

584,340

45.6

Corpus

Link

Link

Swahili

2012

5.3

170,168

1.1

123,601

7.0

228,011

13.4

Corpus

Link

Link

Swedish

2011-12

26.9

774,117

23.2

1,770,655

23.9

855,034

74.1

Corpus

Link

Link

Tagalog

2012

5.1

184,580

4.6

505,743

4.2

136,199

13.9

Corpus

Link

Tamil

2011-12

4.0

205,510

3.6

375,178

3.2

133,706

10.8

Corpus

Link

Telugu

2011-12

5.0

216,574

4.8

119,265

9.9

Corpus

Link

Link

Turkish

2011-12

22.0

914,741

20.6

1,924,915

21.1

697,728

63.7

Corpus

Link

Link

Ukrainian

2011

10.8

379,212

6.7

570,684

11.3

306,617

28.8

Corpus

Link

Link

Urdu

2012

3.3

76,439

0.6

89,109

3.9

74,878

7.7

Corpus

Link

Uzbek

2012

5.1

148,161

5.1

Corpus

Link

Vietnamese

2012

16.4

402,515

12.2

838,067

17.6

284,419

46.1

Corpus

Link

Link

Welsh

2013

2.0

41,092

1.8

63,602

3.8

Corpus

Link

Link