/Release/v0.4
525f1ca] - 启用 changelog 生成并将日志内容整合到发布描述中 (samsong 10:50)
___ ____ ____ ____ ____ ®
/__ / ____/ / ____/ Stata 18.0
___/ / /___/ / /___/ MP—Parallel Edition
Statistics and Data Science Copyright 1985-2023 StataCorp LLC
StataCorp
4905 Lakeway Drive
College Station, Texas 77845 USA
800-782-8272 https://www.stata.com
979-696-4600 service@stata.com
Stata license: Single-user 2-core , expiring 13 Feb 2026
Serial number: 501809366391
Licensed to: sam
NJAU
Notes:
1. Stata is running in batch mode.
2. Unicode is supported; see help unicode_advice.
3. More than 2 billion observations are allowed; see help obs_advice.
4. Maximum number of variables is set to 5,000 but can be increased;
see help set_maxvar.
. do word_freq_count.do
. set maxvar 120000
.
. * 定义关键词库
. global keywords "人工智能 机器学习 深度学习 神经网络 自然语言处理 计算机视觉
> 强化学习 大数据 算法 模型"
.
. * 获取当前目录下的所有文本文件
. local files : dir . files "*.txt"
.
. * 创建结果数据集
. clear
. set obs 1
Number of observations (_N) was 0, now 1.
. gen filename = ""
(1 missing value generated)
. foreach word in $keywords {
2. gen `word'_count = 0
3. }
.
. * 遍历每个文件
. local row = 1
. foreach file of local files {
2. * 读取文件内容
. cap file close myfile
3. file open myfile using "`file'", read text
4. file read myfile line
5.
. local content = ""
6. while r(eof) == 0 {
7. local content = `"`content' `line'"'
8. file read myfile line
9. }
10. file close myfile
11.
. * 统计每个关键词的出现次数
. set obs `row'
12. replace filename = "`file'" in `row'
13.
. foreach word in $keywords {
14. * 使用正则表达式统计关键词出现次数
. local count = 0
15. local temp_content = `"`content'"'
16.
. while regexm(`"`temp_content'"', "`word'") {
17. local count = `count' + 1
18. local temp_content = regexr(`"`temp_content'"', "`word'", "")
19. }
20.
. replace `word'_count = `count' in `row'
21. }
22.
. local row = `row' + 1
23. }
Number of observations (_N) was 1, now 1.
variable filename was str1 now str22
(1 real change made)
(1 real change made)
(0 real changes made)
(0 real changes made)
(0 real changes made)
(0 real changes made)
(0 real changes made)
(0 real changes made)
(1 real change made)
(1 real change made)
(1 real change made)
Number of observations (_N) was 1, now 2.
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
.
. * 显示结果
. list filename *_count
+--------------------------------------------------------------------+
1. | filename | 人工智~t | 机器学~t | 深度学~t | 神经网~t |
| MinerU_601398_2021.txt | 4 | 0 | 0 | 0 |
|--------------------------------------------------------------------|
| 自然语~t | 计算机~t | 强化学~t | 大数据~t | 算法_c~t | 模型_c~t |
| 0 | 0 | 0 | 6 | 1 | 60 |
+--------------------------------------------------------------------+
+--------------------------------------------------------------------+
2. | filename | 人工智~t | 机器学~t | 深度学~t | 神经网~t |
| MinerU_601398_2022.txt | 4 | 0 | 0 | 0 |
|--------------------------------------------------------------------|
| 自然语~t | 计算机~t | 强化学~t | 大数据~t | 算法_c~t | 模型_c~t |
| 0 | 0 | 0 | 9 | 0 | 63 |
+--------------------------------------------------------------------+
.
. * 可选:保存结果到CSV文件
. export delimited using "keyword_counts.csv", replace
(file keyword_counts.csv not found)
file keyword_counts.csv saved
.
. * 可选:生成汇总统计
. egen total_count = rowtotal(*_count)
. list filename total_count
+-----------------------------------+
| filename total_~t |
|-----------------------------------|
1. | MinerU_601398_2021.txt 71 |
2. | MinerU_601398_2022.txt 76 |
+-----------------------------------+
.
end of do-file