- Natural Language Processing with Java and LingPipe Cookbook
- Breck Baldwin Krishna Dayanidhi
- 780字
- 2021-08-05 17:12:49
Getting data from the Twitter API
We use the popular twitter4j
package to invoke the Twitter Search API, and search for tweets and save them to disk. The Twitter API requires authentication as of Version 1.1, and we will need to get authentication tokens and save them in the twitter4j.properties
file before we get started.
Getting ready
If you don't have a Twitter account, go to twitter.com/signup and create an account. You will also need to go to dev.twitter.com and sign in to enable yourself for the developer account. Once you have a Twitter login, we'll be on our way to creating the Twitter OAuth credentials. Be prepared for this process to be different from what we are presenting. In any case, we will supply example results in the data
directory. Let's now create the Twitter OAuth credentials:
- Log in to dev.twitter.com.
- Find the little pull-down menu next to your icon on the top bar.
- Choose My Applications.
- Click on Create a new application.
- Fill in the form and click on Create a Twitter application.
- The next page contains the OAuth settings.
- Click on the Create my access token link.
- You will need to copy Consumer key and Consumer secret.
- You will also need to copy Access token and Access token secret.
- These values should go into the
twitter4j.properties
file in the appropriate locations. The properties are as follows:debug=false oauth.consumerKey=ehUOExampleEwQLQpPQ oauth.consumerSecret=aTHUGTBgExampleaW3yLvwdJYlhWY74 oauth.accessToken=1934528880-fiMQBJCBExamplegK6otBG3XXazLv oauth.accessTokenSecret=y0XExampleGEHdhCQGcn46F8Vx2E
How to do it...
Now, we're ready to access Twitter and get some search data using the following steps:
- Go to the directory of this chapter and run the following command:
java -cp lingpipe-cookbook.1.0.jar:lib/twitter4j-core-4.0.1.jar:lib/opencsv-2.4.jar:lib/lingpipe-4.1.0.jar com.lingpipe.cookbook.chapter1.TwitterSearch
- The code displays the output file (in this case, a default value). Supplying a path as an argument will write to this file. Then, type in your query at the prompt:
Writing output to data/twitterSearch.csv Enter Twitter Query:disney
- The code then queries Twitter and reports every 100 tweets found (output truncated):
Tweets Accumulated: 100 Tweets Accumulated: 200 … Tweets Accumulated: 1500 writing to disk 1500 tweets at data/twitterSearch.csv
This program uses the search query, searches Twitter for the term, and writes the output (limited to 1500 tweets) to the .csv
file name that you specified on the command line or uses a default.
How it works...
The code uses the twitter4j
library to instantiate TwitterFactory
and searches Twitter using the user-entered query. The start of main()
at src/com/lingpipe/cookbook/chapter1/TwitterSearch.java
is:
String outFilePath = args.length > 0 ? args[0] : "data/twitterSearch.csv"; File outFile = new File(outFilePath); System.out.println("Writing output to " + outFile); BufferedReader reader = new BufferedReader(new InputStreamReader(System.in)); System.out.print("Enter Twitter Query:"); String queryString = reader.readLine();
The preceding code gets the outfile, supplying a default if none is provided, and takes the query from the command line.
The following code sets up the query according to the vision of the twitter4j developers. For more information on this process, read their Javadoc. However, it should be fairly straightforward. In order to make our result set more unique, you'll notice that when we create the query string, we will filter out retweets using the -filter:retweets
option. This is only somewhat effective; see the Eliminate near duplicates with the Jaccard distance recipe later in this chapter for a more complete solution:
Twitter twitter = new TwitterFactory().getInstance(); Query query = new Query(queryString + " -filter:retweets"); query.setLang("en");//English query.setCount(TWEETS_PER_PAGE); query.setResultType(Query.RECENT);
We will get the following result:
List<String[]> csvRows = new ArrayList<String[]>(); while(csvRows.size() < MAX_TWEETS) { QueryResult result = twitter.search(query); List<Status> resultTweets = result.getTweets(); for (Status tweetStatus : resultTweets) { String row[] = new String[Util.ROW_LENGTH]; row[Util.TEXT_OFFSET] = tweetStatus.getText(); csvRows.add(row); } System.out.println("Tweets Accumulated: " + csvRows.size()); if ((query = result.nextQuery()) == null) { break; } }
The preceding snippet is a pretty standard code slinging, albeit without the usual hardening for external facing code—try/catch, timeouts, and retries. One potentially confusing bit is the use of query
to handle paging through the search results—it returns null
when no more pages are available. The current Twitter API allows a maximum of 100 results per page, so in order to get 1500 results, we need to rerun the search until there are no more results, or until we get 1500 tweets. The next step involves a bit of reporting and writing:
System.out.println("writing to disk " + csvRows.size() + " tweets at " + outFilePath); Util.writeCsvAddHeader(csvRows, outFile);
The list of tweets is then written to a .csv
file using the Util.writeCsvAddHeader
method:
public static void writeCsvAddHeader(List<String[]> data, File file) throws IOException { CSVWriter csvWriter = new CSVWriter(new OutputStreamWriter(new FileOutputStream(file),Strings.UTF8)); csvWriter.writeNext(ANNOTATION_HEADER_ROW); csvWriter.writeAll(data); csvWriter.close(); }
We will be using this .csv
file to run the language ID test in the next section.
See also
For more details on using the Twitter API and twitter4j, please go to their documentation pages:
- Advanced Machine Learning with Python
- 解構產品經理:互聯網產品策劃入門寶典
- Oracle從新手到高手
- Mastering QGIS
- Python自動化運維快速入門
- Python從入門到精通(精粹版)
- Java Web開發詳解
- 數據科學中的實用統計學(第2版)
- 高效使用Greenplum:入門、進階與數據中臺
- Java高并發編程詳解:深入理解并發核心庫
- Mastering Drupal 8
- Managing Windows Servers with Chef
- 零基礎學Java(升級版)
- 給產品經理講技術
- Xamarin Mobile Application Development for Android(Second Edition)