This is another matter. This article explains the preliminary preparation, how to execute, and how to realize from three dimensions. Preliminary preparation before, a great god used a crawler to extract the data of all the short videos of douyin in one day, with a total of more than 20,000 pieces of data. After getting the data, use this data to wash out the key points we want. Wash out the labels of the crowd, the labels include hobbies, concerns, and time points. After cleaning the 20,000 data, after describing the word frequency statistics of word segmentation, excluding invalid words, the high-frequency words such as "like", "self", "really", "can", "tutorial", "hairstyle", "civic", "makeup", etc.
Here we elicit what can be done later. Next is the time period. This is mainly to count the habits of users. Objectively, it can be seen in which time period the user is more active. Then, according to the time of the above special email list user posting time period, the number of likes and forwards of users in this time period is counted as a reference, and the following figure is finally obtained. In this way, a more accurate time period can be obtained. The difference in effect can be clearly seen in the time period. 13:00 pm and 18:00 pm are the peak likes. In the statistics of 20,000 pieces of data, the distribution curve of likes is cleaned again.
The approximate data distribution is that most of the short videos have less than 700 likes, and the proportion of tens of thousands of short videos is not large. This is due to the data of douyin, according to the previous algorithm for today's headlines, it is to calculate the amount of likes and reposts of your first 1,000 recommendations. At this time, if your forwarding volume and likes (proportion) are high, the next batch of traffic will be pushed to you. So the first 1000 likes are very important. The biggest section of data cleaning after using tools is to clean human flesh again.