/Release/v0.3
___ ____ ____ ____ ____ ®
/__ / ___/ / __/ Stata 18.0
/ / // / // MP—Parallel Edition
Statistics and Data Science Copyright 1985-2023 StataCorp LLC
StataCorp
4905 Lakeway Drive
College Station, Texas 77845 USA
800-782-8272 https://www.stata.com
979-696-4600 service@stata.com
Stata license: Single-user 2-core , expiring 13 Feb 2026
Serial number: 501809366391
Licensed to: sam
NJAU
Notes:
1. Stata is running in batch mode.
2. Unicode is supported; see help unicode_advice.
3. More than 2 billion observations are allowed; see help obs_advice.
4. Maximum number of variables is set to 5,000 but can be increased;
see help set_maxvar.
. do word_freq_count.do
. set maxvar 120000
.
. * 定义关键词库
. global keywords "人工智能 机器学习 深度学习 神经网络 自然语言处理 计算机视觉
强化学习 大数据 算法 模型"
.
. * 获取当前目录下的所有文本文件
. local files : dir . files "*.txt"
.
. * 创建结果数据集
. clear
. set obs 1
Number of observations (_N) was 0, now 1.
. gen filename = ""
(1 missing value generated)
. foreach word in $keywords {
2. gen `word'_count = 0
3. }
.
. * 遍历每个文件
. local row = 1
. foreach file of local files {
2. * 读取文件内容
. cap file close myfile
3. file open myfile using "file'", read text 4. file read myfile line 5. . local content = "" 6. while r(eof) == 0 { 7. local content = "content' line'"'
8. file read myfile line
9. }
10. file close myfile
11.
. * 统计每个关键词的出现次数
. set obs row' 12. replace filename = "file'" in row' 13. . foreach word in $keywords { 14. * 使用正则表达式统计关键词出现次数 . local count = 0 15. local temp_content = "content'"' 16. . while regexm("temp_content'"', "word'") {
17. local count = count' + 1 18. local temp_content = regexr("temp_content'"', "word'", "")
19. }
20.
. replace word'_count = count' in row' 21. } 22. . local row = row' + 1
23. }
Number of observations (_N) was 1, now 1.
variable filename was str1 now str22
(1 real change made)
(1 real change made)
(0 real changes made)
(0 real changes made)
(0 real changes made)
(0 real changes made)
(0 real changes made)
(0 real changes made)
(1 real change made)
(1 real change made)
(1 real change made)
Number of observations (_N) was 1, now 2.
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
.
. * 显示结果
. list filename *_count
+--------------------------------------------------------------------+
| filename | 人工智t | 机器学t | 深度学t | 神经网t |
| MinerU_601398_2021.txt | 4 | 0 | 0 | 0 |
|--------------------------------------------------------------------|
| 自然语t | 计算机t | 强化学t | 大数据t | 算法_ct | 模型_ct |
| 0 | 0 | 0 | 6 | 1 | 60 |
+--------------------------------------------------------------------+
+--------------------------------------------------------------------+
| filename | 人工智t | 机器学t | 深度学t | 神经网t |
| MinerU_601398_2022.txt | 4 | 0 | 0 | 0 |
|--------------------------------------------------------------------|
| 自然语t | 计算机t | 强化学t | 大数据t | 算法_ct | 模型_ct |
| 0 | 0 | 0 | 9 | 0 | 63 |
+--------------------------------------------------------------------+
.
. * 可选:保存结果到CSV文件
. export delimited using "keyword_counts.csv", replace
(file keyword_counts.csv not found)
file keyword_counts.csv saved
.
. * 可选:生成汇总统计
. egen total_count = rowtotal(*_count)
. list filename total_count
+-----------------------------------+
| filename total_~t |
|-----------------------------------|
.
end of do-file