读中国朗诵伴奏纯音乐

書名： Natural Language Processing with Java and LingPipe Cookbook
作者名： Breck Baldwin Krishna Dayanidhi
本章字?jǐn)?shù)： 780字
更新時間： 2021-08-05 17:12:49

Getting data from the Twitter API

We use the popular twitter4j package to invoke the Twitter Search API, and search for tweets and save them to disk. The Twitter API requires authentication as of Version 1.1, and we will need to get authentication tokens and save them in the twitter4j.properties file before we get started.

Getting ready

If you don't have a Twitter account, go to twitter.com/signup and create an account. You will also need to go to dev.twitter.com and sign in to enable yourself for the developer account. Once you have a Twitter login, we'll be on our way to creating the Twitter OAuth credentials. Be prepared for this process to be different from what we are presenting. In any case, we will supply example results in the data directory. Let's now create the Twitter OAuth credentials:

Log in to dev.twitter.com.
Find the little pull-down menu next to your icon on the top bar.
Choose My Applications.
Click on Create a new application.
Fill in the form and click on Create a Twitter application.
The next page contains the OAuth settings.
Click on the Create my access token link.
You will need to copy Consumer key and Consumer secret.
You will also need to copy Access token and Access token secret.

These values should go into the twitter4j.properties file in the appropriate locations. The properties are as follows:

debug=false
oauth.consumerKey=ehUOExampleEwQLQpPQ
oauth.consumerSecret=aTHUGTBgExampleaW3yLvwdJYlhWY74
oauth.accessToken=1934528880-fiMQBJCBExamplegK6otBG3XXazLv
oauth.accessTokenSecret=y0XExampleGEHdhCQGcn46F8Vx2E

How to do it...

Now, we're ready to access Twitter and get some search data using the following steps:

Go to the directory of this chapter and run the following command:

java -cp lingpipe-cookbook.1.0.jar:lib/twitter4j-core-4.0.1.jar:lib/opencsv-2.4.jar:lib/lingpipe-4.1.0.jar com.lingpipe.cookbook.chapter1.TwitterSearch

The code displays the output file (in this case, a default value). Supplying a path as an argument will write to this file. Then, type in your query at the prompt:
```
Writing output to data/twitterSearch.csv
Enter Twitter Query:disney
```

The code then queries Twitter and reports every 100 tweets found (output truncated):

Tweets Accumulated: 100
Tweets Accumulated: 200
…
Tweets Accumulated: 1500
writing to disk 1500 tweets at data/twitterSearch.csv

This program uses the search query, searches Twitter for the term, and writes the output (limited to 1500 tweets) to the .csv file name that you specified on the command line or uses a default.

How it works...

The code uses the twitter4j library to instantiate TwitterFactory and searches Twitter using the user-entered query. The start of main() at src/com/lingpipe/cookbook/chapter1/TwitterSearch.java is:

String outFilePath = args.length > 0 ? args[0] : "data/twitterSearch.csv";
File outFile = new File(outFilePath);
System.out.println("Writing output to " + outFile);
BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
System.out.print("Enter Twitter Query:");
String queryString = reader.readLine();

The preceding code gets the outfile, supplying a default if none is provided, and takes the query from the command line.

The following code sets up the query according to the vision of the twitter4j developers. For more information on this process, read their Javadoc. However, it should be fairly straightforward. In order to make our result set more unique, you'll notice that when we create the query string, we will filter out retweets using the -filter:retweets option. This is only somewhat effective; see the Eliminate near duplicates with the Jaccard distance recipe later in this chapter for a more complete solution:

Twitter twitter = new TwitterFactory().getInstance();
Query query = new Query(queryString + " -filter:retweets"); query.setLang("en");//English
query.setCount(TWEETS_PER_PAGE);
query.setResultType(Query.RECENT);

We will get the following result:

List<String[]> csvRows = new ArrayList<String[]>();
while(csvRows.size() < MAX_TWEETS) {
  QueryResult result = twitter.search(query);
  List<Status> resultTweets = result.getTweets();
  for (Status tweetStatus : resultTweets) {
    String row[] = new String[Util.ROW_LENGTH];
    row[Util.TEXT_OFFSET] = tweetStatus.getText();
    csvRows.add(row);
  }
  System.out.println("Tweets Accumulated: " + csvRows.size());
  if ((query = result.nextQuery()) == null) {
    break;
  }
}

The preceding snippet is a pretty standard code slinging, albeit without the usual hardening for external facing code—try/catch, timeouts, and retries. One potentially confusing bit is the use of query to handle paging through the search results—it returns null when no more pages are available. The current Twitter API allows a maximum of 100 results per page, so in order to get 1500 results, we need to rerun the search until there are no more results, or until we get 1500 tweets. The next step involves a bit of reporting and writing:

System.out.println("writing to disk " + csvRows.size() + " tweets at " + outFilePath);
Util.writeCsvAddHeader(csvRows, outFile);

The list of tweets is then written to a .csv file using the Util.writeCsvAddHeader method:

public static void writeCsvAddHeader(List<String[]> data, File file) throws IOException {
  CSVWriter csvWriter = new CSVWriter(new OutputStreamWriter(new FileOutputStream(file),Strings.UTF8));
  csvWriter.writeNext(ANNOTATION_HEADER_ROW);
  csvWriter.writeAll(data);
  csvWriter.close();
}

We will be using this .csv file to run the language ID test in the next section.

官术网_书友最值得收藏!

Getting data from the Twitter API

Getting ready

How to do it...

How it works...

See also

官术网_书友最值得收藏!

Natural Language Processing with Java and LingPipe Cookbook

Getting data from the Twitter API

Getting ready

How to do it...

How it works...

See also